Co-design for System Acceleration A Quantitative Approach
CO-DESIGN FOR SYSTEM ACCELERATION A Quantitative Approach
...
18 downloads
542 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Co-design for System Acceleration A Quantitative Approach
CO-DESIGN FOR SYSTEM ACCELERATION A Quantitative Approach
NADIA NEDJAH Department of Electronics Engineering and Telecommunications, State University of Rio de Janeiro, Brazil
LUIZA DE MACEDO MOURELLE Department of Systems Engineering and Computation, State University of Rio de Janeiro, Brazil
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN-13 978-1-4020-5545-4 (HB) ISBN-13 978-1-4020-5546-1 (e-book)
Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. www.springer.com
Printed on acid-free paper
All Rights Reserved c 2007 Springer No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.
To my mother and sisters, Nadia
To my father (in memory) and mother, Luiza
Contents
Dedication List of Figures List of Tables Preface Acknowledgments
v xi xv xvii xix
1. INTRODUCTION
1
1.1
Synthesis
2
1.2
Design Approaches
3
1.3
Co-Design 1.3.1 Methodology 1.3.2 Simulation 1.3.3 Architecture 1.3.4 Communication
4 5 6 6 6
1.4
Structure and Objective
7
2. THE CO-DESIGN METHODOLOGY
9
2.1
The Co-Design Approach
10
2.2
System Specification
11
2.3
Hardware/Software Partitioning
12
2.4
Hardware Synthesis 2.4.1 High-Level Synthesis 2.4.2 Implementation Technologies 2.4.3 Synthesis Systems
15 16 17 20
2.5
Software Compilation
21
2.6
Interface Synthesis
22 vii
viii
Contents
2.7 2.8
System Integration Summary
23 27
3. THE CO-DESIGN SYSTEM 3.1 Development Route 3.1.1 Hardware/Software Profiling 3.1.2 Hardware/Software Partitioning 3.1.3 Hardware Synthesis 3.1.4 Software Compilation 3.1.5 Run-Time System 3.2 Target Architecture 3.2.1 Microcontroller 3.2.2 Global Memory 3.2.3 Controllers 3.2.4 Bus Interface 3.2.5 The Coprocessor 3.2.6 The Timer 3.3 Performance Results 3.3.1 First Benchmark: PLUM Program 3.3.2 Second Benchmark: EGCHECK Program 3.3.3 Results Analysis 3.4 Summary
29 30 31 33 33 36 36 37 38 40 41 42 45 45 45 46 47 48 50
4. VHDL MODEL OF THE CO-DESIGN SYSTEM 4.1 Modelling with VHDL 4.1.1 Design Units and Libraries 4.1.2 Entities and Architectures 4.1.3 Hierarchy 4.2 The Main System 4.3 The Microcontroller 4.3.1 Clock and Reset Generator 4.3.2 Sequencer 4.3.3 Bus Arbiter 4.3.4 Memory Read and Write 4.4 The Dynamic Memory: DRAM 4.5 The Coprocessor 4.5.1 Clock Generator 4.5.2 Coprocessor Data Buffers 4.6 Summary
53 54 55 55 57 58 60 61 61 65 68 72 74 75 76 77
Contents
ix
5. SHARED MEMORY CONFIGURATION 5.1 Case Study 5.2 Timing Characteristics 5.2.1 Parameter Passing 5.2.2 Bus Arbitration 5.2.3 Busy-Wait Mechanism 5.2.4 Interrupt Mechanism 5.3 Relating Memory Accesses and Interface Mechanisms 5.3.1 Varying Internal Operations and Memory Accesses 5.3.2 Varying the Coprocessor Memory Access Rate 5.3.3 Varying the Number of Coprocessor Memory Accesses 5.4 Summary
81 82 85 87 87 88 90 92 94 96 98 105
6. DUAL-PORT MEMORY CONFIGURATION 6.1 General Description 6.1.1 Contention Arbitration 6.1.2 Read/Write Operations 6.2 The System Architecture 6.2.1 Dual-Port Memory Model 6.2.2 The Coprocessor 6.2.3 Bus Interface Controller 6.2.4 Coprocessor Memory Controller 6.2.5 The Main Controller 6.3 Timing Characteristics 6.3.1 Interface Mechanisms 6.4 Performance Results 6.4.1 Varying Internal Operations and Memory Accesses 6.4.2 Varying the Memory Access Rate 6.4.3 Varying the Number of Memory Accesses 6.4.4 Speedup Achieved 6.5 Summary
107 108 108 110 111 111 113 113 114 117 118 120 121
7. CACHE MEMORY CONFIGURATION 7.1 Memory Hierarchy Design 7.1.1 General Principles 7.1.2 Cache Memory
133 134 134 135
121 126 127 129 130
x
Contents
7.2
7.3 7.4
7.5
System Organization 7.2.1 Cache Memory Model 7.2.2 The Coprocessor 7.2.3 Coprocessor Memory Controller 7.2.4 The Bus Interface Controller Timing Characteristics 7.3.1 Block Transfer During Handshake Completion Performance Results 7.4.1 Varying the Number of Addressed Locations 7.4.2 Varying the Block Size 7.4.3 Varying the Number of Memory Accesses 7.4.4 Speedup Achieved 7.4.5 Miss Rate with Random Address Locations Summary
138 138 139 140 145 150 155 158 159 162 164 166 167 169
8. ADVANCED TOPICS AND FURTHER RESEARCH 8.1 Conclusions and Achievements 8.2 Advanced Topics and Further Research 8.2.1 Complete VHDL Model 8.2.2 Cost Evaluation 8.2.3 New Configurations 8.2.4 Interface Synthesis 8.2.5 Architecture Synthesis 8.2.6 Framework for co-design 8.2.7 General Formalization
173 173 176 176 177 177 177 177 177 178
Appendices A Benchmark Programs B Top-Level VHDL Model of the Co-design System C Translating PALASMT M into VHDL D VHDL Version of the Case Study
185 185 191 199 205
References
219
Index
225
List of Figures
2.1 2.2 2.3 2.4 2.5 2.6 3.1 3.2 3.3 3.4 3.5 3.6 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10
The co-design flow Typical CLB connections to adjacent lines Intel FLEXlogic iFX780 configuration Target architecture with parameter memory in the coprocessor Target architecture with memory-mapped parameter registers Target architecture using a general-purpose processor and ASICs Development route Hardware synthesis process Run-time system Run-time system and interfaces Target architecture Bus interface control register Main system configuration Coprocessor board components Logic symbol for the microcontroller VHDL model for the clock and reset generator Writing into the coprocessor control register The busy-wait model The interrupt routine model Completing the handshake, by negating N copro st Algorithmic state machine for the bus arbiter Algorithmic state machine for memory read/write xi
11 19 21 24 25 26 31 34 37 38 39 43 59 60 60 62 63 64 65 66 67 70
xii
List of Figures
4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12
Logic symbol for the DRAM 16 Algorithmic state machine for the DRAM Logic symbol of the coprocessor Logic symbol of the coprocessor clock generator Flowchart for the coprocessor clock generator Logic symbol of the coprocessor data buffers Flowchart for the coprocessor input buffer control Flowchart for the coprocessor output buffer control C program of example Modified C program Passing parameter table to the coprocessor Sampling of the bus request input signal (N br) Bus arbitration without contention Bus arbitration when using busy-wait Handshake completion when using busy-wait End of coprocessor operation with interrupt Handshake completion when using interrupt Graphical representation for Tbw and Tint , in terms of iterations Graphical representation for Tb and Ti , in terms of accesses Relation between N memf in and mem f in, for busy-wait Logic symbol of the dual-port memory Arbitration logic Main system architecture Coprocessor board for the dual-port configuration Logic symbol of DRAM16 State machine for the dual-port memory model Logic symbol of the coprocessor for the dual-port configuration Logic symbol of the bus interface controller Logic symbol of the coprocessor memory controller State machine of the coprocessor memory accesses controller State machine of the DRAM controller Logic symbol of the main controller
72 73 74 75 75 76 77 78 83 84 86 88 89 90 91 92 93 97 99 103 108 110 111 112 112 114 115 115 116 117 118 119
List of Figures
6.13 6.14 6.15 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.20 7.21
Chart for Tb and Ti , in terms of iterations Coprocessor memory access Synchronization between memory controller and accelerator Addressing different memory levels Block placement policies, with 8 cache blocks and 32 main memory blocks Coprocessor board for the cache memory configuration Logic symbol of the cache memory model Flowchart for the cache memory model Logic symbol of the coprocessor for the cache configuration Logic symbol for the coprocessor memory controller Virtual address for the cache memory configuration Cache directory Algorithmic state machine for the coprocessor memory controller Logic symbol of the bus interface controller for the cache configuration Algorithmic state machine for the block transfer Algorithmic state machine for the block updating Coprocessor cache memory write Bus arbitration when there is a cache miss Transferring a word from the cache to the main memory End of the block transfer during a cache miss End of coprocessor operation and beginning of block update End of transfer of block 0 Handshake completion after block updating C program example, with random address locations
xiii 123 124 125 135 136 138 139 140 140 141 141 142 143 146 147 149 151 152 153 155 157 158 159 168
List of Tables
2.1 3.1 3.2 5.1 5.2 5.3 5.4 5.5 5.6 5.7 6.1 6.2 6.3 6.4 6.5
Some characteristics of the XC4000 family of FPGAs Performance results for PLUM Performance results for EGCHECK Performing 10 internal operations and a single memory access per iteration Performing 100 internal operations and one memory access per iteration Performing 10 iterations and a single memory access per iteration Performing 10 iterations and 10 internal operations Performing 10 internal operations and no memory accesses Performing 10 internal operations and 2 memory accesses per iteration Performing 10 internal operations and 3 memory accesses per iteration Performing 10 internal operations and a single memory access per iteration Performing 10 internal operations and a single memory access per iteration Performing 10 iterations and a single memory access per iteration Performing 10 iterations and 10 internal operations Performing 10 internal operations and 2 memory accesses per iteration
xv
18 47 48 94 96 96 98 100 101 101 121 126 127 128 129
xvi
List of Tables
7.1 7.2 7.3 7.4 7.5
7.6
Performing 10 operations and 1 memory access per iteration with BS = 512 bytes Performing 10 operations and 1 memory access per iteration, with BS = 256 bytes Performing 10 operations and 1 memory access per iteration, with BS = 128 bytes Performing 10 iterations and 10 internal operations, with BS = 512 bytes Performing 10 internal operations and 1 memory access per iteration, with BS = 512 bytes and random address locations Performing 10 operations and 1 memory access per iteration, with BS = 128 bytes and random address locations
160 163 164 165
167
169
Preface
In this Book, we are concerned with studying the co-design methodology, in general, and how to determine the more suitable interface mechanism in a co-design system, in particular. This will be based on the characteristics of the application and those of the target architecture of the system. We provide guidelines to support the designer’s choice of the interface mechanism. The content of this book is divided into 8 chapters, which will be described in the following: In Chapter 2, we present co-design as a methodology for the integrated design of systems implemented using both hardware and software components. This includes high-level synthesis and the new technologies available for its implementation. Recent work in the co-design area is introduced. In Chapter 3, the physical co-design system developed at UMIST is then presented. The development route adopted is discussed and the target architecture described. Performance results are then presented based on experimental results obtained. The relation between the execution times and the interface mechanisms is analysed. In order to investigate the performance of the co-design system for different characteristics of the application and of the architecture, we developed, in Chapter 4, a VHDL model of our co-design system. In Chapter 5, a case study example is presented, on which all the subsequent analysis will be carried out. The timing characteristics of the system are introduced, that is times for parameter passing and bus arbitration for each interface mechanism, together with their handshake completion times. The relation between the coprocessor memory accesses and the interface mechanisms is then studied. In Chapter 6, a dual-port shared memory configuration is introduced, in substitution to the single-port shared memory of the original configuration. This new configuration aims to reduce the occurrence of bus contention, naturally present in a shared bus architecture. Our objective is to identify performance xvii
xviii
Preface
improvements due to the substitution of the single-port shared memory by the dual-port shared memory. In Chapter 7, A cache memory for the coprocessor is later on introduced into the original single-port shared memory configuration. This is an alternative to the dual-port memory, allowing us to reduce bus contention, while keeping the original shared memory configuration. The identification of performance improvements, due to the inclusion of a cache memory for the coprocessor in the original implementation, is then carried out. In Chapter 8, we describe new trends in co-design and software acceleration. N. Nedjah and L. M. Mourelle
Acknowledgments
We are grateful to FAPERJ (Fundac¸˜ao de Amparo `a Pesquisa do Estado do Rio de janeiro, http://www.faperj.br) and CNPq (Conselho Nacional de Desenvolvimento Cient´ıfico e Tecnol´ogico, http://www.cnpq.br) for their continuous financial support.
xix
Chapter 1 INTRODUCTION
In a digital system, hardware is usually considered to be those parts of the system implemented using electronic components, such as processors, registers, logic gates, memories and drivers. Software is thought of as the sub-systems implemented as programs stored in memory as a sequence of bits, which are read and executed by a processor. In a traditional design strategy, the hardware and software partitioning decisions are fixed at an early stage in the development cycle and both designs evolve separately (Kalavade and Lee, 1992; Kalavade and Lee, 1993; N. S. Woo and Wolf, 1994). Certain operations are clearly implemented by hardware, such as high-speed data packet manipulation; others by software, such as recursive search of a tree data structure; and there are usually a collection of further operations that can be implemented either by hardware or by software. The decision to implement an operation in hardware or software is based on the available technology, cost, size, maintainability, flexibility and, probably most importantly, performance. Advances in microelectronics have offered us large systems containing a variety of interacting components, such as general-purpose processors, communicationsub-systems, special-purpose processors (e.g., Digital Signal Processors – DSP), micro-programmed special-purpose architectures, off–the-shelf electronic components, logic-array structures (e.g., Field Programmable GateArrays – FPGAs) and custom logic devices (e.g., Application Specific Integrated circuits – ASICs) (Subrahmanyam, 1992; Subrahmanyam, 1993; Wolf, 2004). Today’s products contain embedded systems implemented with hardware controlled by a large amount of software (G. Borriello, 1993). These systems have processors dedicated to specific functions and different degrees of 1
2
Introduction
programmability (Micheli, 1993) (e.g., microcontrollers), namely the application, instruction or hardware levels. In the application level, the system is running dedicated software programs that allows the user to specify the desired functionality using a specialized language. At the instruction level, programming is achieved by executing on the hardware, the instructions supported by the architecture. hardware-level programming means configuring the hardware, after manufacturing, in the desired way. There have also been advances in some of the technologies related to design system, such as logic synthesis, and system level simulation environments, together with formal methods for design specification, design and verification (Subrahmanyam, 1992; Subrahmanyam, 1993). In this chapter, we discuss the concept of synthesis and design approaches. Later on, the co-design concept is introduced, together with a methodology for its application. Finally, the objective of this book is then presented, together with its structure.
1.1
Synthesis
Synthesis is the automatic translation of a design description from one level of abstraction to a more detailed, lower level of abstraction. A behavioral description defines the mapping of a system’s inputs to its outputs and a structural description indicates the set of interconnected components that realize the required system behavior. The synthesis process offers reduction in the overall design time and cost of a product, together with a guarantee of the circuit correctness. However, the synthesized design may not be as good as one produced manually by an experienced designer. Architectural synthesis allows the search for the optimal architectural solution relative to the specified constraints (E. Martin and Philippe, 1993). High-level synthesis is the process of deriving hardware implementations for circuit s from high-level programming languages or other high-level specifications (Amon and Borriello, 1991). In order to have a behavioral description of the system, we use a hardware Description Language (HDL), which offers the syntax and semantics needed. As examples of suitable HDLs, we have the Specification and Description Language (SDL) (Rossel and Kruse, 1993; A.A. Jerraya and Ismail, 1993), the Very high speed integrated circuit s hardware Description Language (VHDL) (Ecker, 1993a; Ecker, 1993b; Navabi, 1998), the high-level description language HardwareC (Gupta, 1993; R. K. Gupta and Micheli, 1992a; R. K. Gupta and Micheli, 1992b; R. K. Gupta and Micheli, 1994; Gupta and Micheli, 1992; Gupta and Micheli, 1993; Ku and Micheli, 1990). Instead of a hardware description language, it is possible to use a highlevel programming language, such as C or C++, to describe the system behavior, making the translation to a hardware description language at a later stage of the design process (M. D. Edwards, 1993; Edwards and Forrest, 1994; Edwards
Design Approaches
3
and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b; P. Pochmuller and Longsen, 1993). Using high-level synthesis techniques, we can synthesize digital circuits from a high-level specification (Thomas, 1990; D. E. Thomas and Schmit, 1993) which have the performance advantages offered by customizing the architecture to the algorithm (Srivastava and Brodersen, 1991; M. B. Srivastava and Brodersen, 1992). Nevertheless, as the number of gates increases, the cost and turnaround time, i.e., the time taken from the design specification until its physical implementation, also increase, and for large system designs, synthesized hardware solutions tend to be expensive (Gupta, 1993; Gupta and Micheli, 1993). Despite the increase in power and flexibility achieved by new processors, users are always asking for something more. In order to satisfy this demand, it would be necessary to add special-purpose modules to a processor system. However, this would constrain the system and the usually small market generated for a specialized function discourages companies from developing hardware to satisfy such needs. A solution to this problem would be the use of a reconfigurable system, based on, say, Field-Programmable Gate Arrays (FPGAs), which could be attached to a standard computer, for performing functions normally implemented by special-purpose cards (D. E. Van den Bout, 1992). This kind of implementation offers faster execution, low cost and low power, since the necessary logic is embedded in the ASIC component.
1.2
Design Approaches
For a behavioral description of a system implemented as a program, running on a processor, we have a low cost and flexible solution. Typically, the pure software implementations of a system are often too slow to meet all of the performance constraints, which can be defined for the overall time (latency) to perform a given task or to achieve predetermined input/output rates (R. K. Gupta and Micheli, 1992b). Depending on these constraints, it may be better to have all or part of the behavioral description of the system implemented as a hardware circuit. In this kind of implementation, hardware modifications cannot be performed as dynamically as in a software implementation and the relative cost is higher, since the solution is customized. System designers are faced with a major decision: what kind of implementation is the best, given a behavioral description of a system and a set of performance constraints – a software or a hardware solution? Cost-effective designs use a mixture of hardware and software to accomplish their overall goals (Gupta, 1993; Gupta and Micheli, 1993). Dedicated systems, with hardware and software tailored for the application, normally provide performance improvements over systems based on general-purpose hardware (Srivastava and Brodersen, 1991; M. B. Srivastava and Brodersen, 1992). Mixed system
4
Introduction
designs, using ASICs, memory, processors and other special-purpose modules reduce the size of the synthesis task by minimizing the number of applicationspecific chips required, while, at the same time, achieving the flexibility of software reprogramming to alter system behavior. However, the problem is usually more complex, since the software on the processor implements system functionality in an instruction-driven manner with a statically allocated memory space, whereas ASICs operate as data-driven, ıreactive elements (Chiodo and Sangiovanni-Vincentelli, 1992; Gupta and Micheli, 1992). Nevertheless, additional problems must also be solved (R. K. Gupta and Micheli, 1992a; Gupta and Micheli, 1992), such as: modeling system functionality and constraints; determining the boundaries between hardware and software components in the system model; specifying and synthesizing the hardware-software interface; implementing hardware and software components.
1.3
Co-Design
Co-design refers to the integrated design of systems implemented using both hardware and software components (Subrahmanyam, 1992; Subrahmanyam, 1993), given a set of performance goals and an implementation technology. In this way, it is not a new approach to designing systems. What is new is the requirement for systematic, scientific co-design methods (Wolf, 1993). The co-design problem entails characterizing hardware and software performance, identifying a hardware-software partition, transforming the functional description of a system into such a partition and synthesizing the resulting hardware and software (D. E. Thomas and Schmit, 1993). From a behavioral description of the system, that is an implementation independent one, a partitioning could be performed into hardware and software components based on imposed performance constraints. hardware solutions may provide higher performance by supporting the parallel execution of operations, but there is the inherent cost of fabricating one or more ASICs. On the other hand, software solutions will run on high-performance processors, which are available at lower cost due to high-volume production. However, the serialization of operations and the lack of specific support for some tasks can decrease performance (Micheli, 1993). Designers might start with an all-software implementation, in which case it is said to be a software-oriented approach (M. D. Edwards, 1993; Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b; Ernst and Henkel, 1992; R. Ernst and Benner,
5
Co-Design
1993) and check the implementation’s functionality. Later on, they might refine the design over time to get a mixed hardware-software design (M. D. Edwards, 1993; Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b; N. S. Woo and Wolf, 1994; N. S. Woo and Dunlop, 1992), in which ASIC components are chosen to complement performance or add functionality not achievable by pure program implementations alone (Gupta and Micheli, 1992), such as floating-point operations. On the other hand, it is possible to start with a hardware implementation, called a hardware-oriented approach (R. K. Gupta and Micheli, 1992b), and try, gradually, to move hardware functions to software, taking into account timing constraints and synchronization requirements. A good design trade-off is to improve the event that happens more frequently. The impact of making some occurrence faster is higher if the occurrence is frequent. Therefore, the point now is to decide what the frequent case is and how much performance can be improved by making that case faster. We will return to this subject in Chapter 3, when the co-design system developed at UMIST is discussed.
1.3.1
Methodology
In recent years, several integrated CAD environments for the automatic generation of ASICs have been developed, that have resulted in a reduction in the overall design time of a system (Srivastava and Brodersen, 1991). On the other hand, computer aids are available in order to assist with the structured design of software systems. In practice, we may want to (Chiodo and SangiovanniVincentelli, 1992): formally verify a design in order to check whether the system satisfies a number of properties that specify its correctness; simulate to check whether the system responds correctly to the stimuli that the environment is supposed to produce; automatically synthesize an implementation consistent with user-defined criteria, such as speed and area. The chosen synthesis methodology may be for general-purpose applications (M. D. Edwards, 1993; Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b) or for domainspecific applications, such as DSPs (Kalavade and Lee, 1992; Kalavade and Lee, 1993) or ASIPs (A. Alomary, 1993). Some tools are available using high-level synthesis techniques to generate a purely hardware implementation of a system, such as CATHEDRAL II (J. Rabaey, 1988), OLYMPUS (G. De Micheli and Truong, 1990), System Architecture’s Workbench (Thomas, 1990) and CONES (Stroud, 1992).
6
Introduction
A framework for co-design means a methodology along with a complementary set of tools for specification, development, simulation/prototyping and testing (Subrahmanyam, 1992; Subrahmanyam, 1993). It must provide estimates of the performance metrics, such as size, power, cost, maintainability, flexibility and modifiability. In (Kalavade and Lee, 1992; Kalavade and Lee, 1993), a framework called PTOLEMY is used for simulating and prototyping heterogeneous systems, while in (Buchenrieder and Veith, 1992; Buchenrieder, 1993; Buchenrieder and Veith, 1994) another framework called CODES is used as a environment for concurrent system design.
1.3.2
Simulation
Another interesting problem is related to the simulation of hardware, taking into account the existence of the associated software. This is known as co-simulation (D. Becker and Tell, 1992; D. E. Thomas and Schmit, 1993). In this case, both software and hardware are expected to be developed in parallel and their interaction is analyzed, using the simulation of the hardware components. POSEIDON (R. K. Gupta and Micheli, 1992a; Gupta and Micheli, 1992) is a tool that allows for the simulation of multiple functional modules implemented either as a program or as behavioral or structural hardware models.
1.3.3
Architecture
In a co-design system, the target system architecture usually consists of a software component, as a program running on a re-programmable processor, assisted by application-specific hardware components (M. D. Edwards, 1993; Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b; Gupta, 1993; R. K. Gupta and Micheli, 1992a; R. K. Gupta and Micheli, 1992b; R. K. Gupta and Micheli, 1994; Gupta and Micheli, 1992; Gupta and Micheli, 1993). Some aspects that lead to this kind of implementation are (Srivastava and Brodersen, 1991; Subrahmanyam, 1993): the increasing diversity and complexity of applications employing embedded systems (D. D. Gajski and Gong, 1994); the need for decreasing the cost of designing and testing such systems; advances in some of the key enabling technologies, such as logic synthesis and formal methods.
1.3.4
Communication
From the hardware/software co-design point of view, it is important to ensure the proper communication between the hardware and software sub-systems. If we take a closer look at software running on a processor, we recognize
Structure and Objective
7
that hardware/software communication is nothing else but hardware/hardware communication (Monjau and Bunchenrieder, 1993) at the lower levels. Thus, if we are able to map the abstract communication to distinct protocols for the hardware parts, we can also do this for the communication between the hardware and software parts. From a software perspective, communication with the hardware processes of the system has to be performed via some special device driver software and associated system calls provided by the processor’s operating system (Rossel and Kruse, 1993). Due to the parallelism associated with embedded systems, they can be expressed as a set of concurrent sequential processes communicating via message queues (Monjau and Bunchenrieder, 1993; M. B. Srivastava and Brodersen, 1992). Amon (Amon and Borriello, 1991) discusses the problem of sizing synchronization queues, in order to implement constructs such as send and receive.
1.4
Structure and Objective
In the rest of the book, we present co-design as a methodology for the integrated design of systems implemented using both hardware and software components. The initial system specification is partitioned into hardware and software sub-systems that will have to communicate during the execution of the application. The objective of this research is to analyze the behavior of the co-design system, based on different interface mechanisms between the hardware and software sub-systems. As a result of this analysis, we provide some guidelines to support the designer in the choice of the most suitable interface, according to the of the application and of the co-design system under consideration. In Chapter 2, the co-design methodology is discussed in more detail. Highlevel synthesis is also presented, together with the new technologies available for its implementation. Along with the discussion, recent work in the co-design area is introduced as examples. In Chapter 3, the physical co-design system developed at UMIST is discussed. The development route is outlined and the target architecture is described. performance results are then presented based on experimental results obtained. The relation between the execution times and the interface mechanisms employed is then discussed. In Chapter 4, we present the VHDL model of our co-design system. Since we have already a physical implementation of the system, the procedure adopted for modeling it is explained. Most of the components will be manually translated from their previous description into VHDL, in a quite straightforward process. Others, such as the global memory, will have to be modeled based on their behavior and how they interface with the rest of the system. In Chapter 5, the simulation of the VHDL model of the co-design system is described. A case study example is presented, on which all the subsequent
8
Introduction
analysis will be carried out. The timing of the system are introduced, that is times for parameter passing and bus arbitration for each interface mechanism, together with their handshake completion times. The relation between the coprocessor memory accesses and the interface mechanisms is then analyzed. In Chapter 6, we continue the simulation process, based on a dual-port shared memory configuration. The new system’s architecture is presented, together with the necessary modifications to the original VHDL models and the derivation of new models. Once again, the timing of the new system are introduced. performance results are obtained based on the same case study used in the previous chapter. Our aim is to identify any performance improvement due to the substitution of the single-port shared memory, in the original implementation, by the dual-port shared memory. In Chapter 7, a cache memory for the coprocessor is introduced into the original single-port shared memory configuration. The concept of memory hierarchy design is discussed, in order to provide the necessary background for the subsequent analysis. So far, the new organization of the system is presented, together with the necessary modifications to the existing VHDL models and the derivation of new modules. Continuing our study of the system’s performance, timing are then presented. performance results are obtained based on the same case study used in the two previous chapters. The identification of performance improvements, due to the inclusion of a cache memory for the coprocessor in the original implementation, is carried out. In Chapter 8, we present some conclusions based on this research and some ideas for future work are also outlined. A comparison between the single shared memory configuration, the dual-port shared memory configuration and the coprocessor cache configuration is undertaken, based on the simulation results obtained.
Chapter 2 THE CO-DESIGN METHODOLOGY
Computing systems are becoming increasingly complex and often contain large amounts of both hardware and software. The need for methodological support in managing the development process of such systems, therefore, becomes more urgent. The user’s requirements are partitioned into functional and nonfunctional subsets, from which a functional view and an architectural view, i.e., the environment the system has to function in, can be derived. The subject of computing is then split between hardware and software at an early stage. The hardware development will be based on cost and performance, whilst the software development will be guided by the functional requirements of the system. Each development process will have its own models in its own environment. Problems discovered during integration and test, but after hardware fabrication, cause projects to run over budget, behind schedule and result in systems that may not fully satisfy user needs (Kuttner, 1996). Hence, it is important to produce systems that are “right first time”. The design of heterogeneous hardware/software systems is driven by a desire to maximize performance and minimize cost. Therefore, it is necessary to use an integrated design environment, which is capable of investigating and modeling the performance and process functionality of the hardware/software system (Cooling, 1995). The design process can be defined as a sequence of transformations that begins with a system specification and leads to an implementation of the system. The steps in the sequence are defined by system models in decreasing levels of abstraction and, at every step in the sequence, an input model is transformed into an output model (Monjau and Bunchenrieder, 1993). This chapter presents the co-design methodology and recent developments in this area. High-level synthesis is also discussed as it is central to the 9
10
The Co-design Methodology
implementation of mixed hardware/software systems. New implementation technologies are presented followed by a review of the available synthesis tools.
2.1
The Co-Design Approach
Co-design refers to a methodology for the integrated design of systems implemented using both hardware and software components (Subrahmanyam, 1993), given a set of performance goals and an implementation technology. In this way, it is not a new approach. What is new is the requirement for systematic, scientific co-design methods (Wolf, 1993). The co-design problem entails (D. E. Thomas and Schmit, 1993): characterizing hardware and software performance; identifying a hardware/software partition; transforming the functional description into such a partition; synthesizing the resulting hardware and software. From a behavioral description of the system, that is implementation independent, a partitioning could be done into hardware and software components, based on imposed performance constraints (R. K. Gupta and Micheli, 1992a; R. K. Gupta and Micheli, 1992b; Gupta and Micheli, 1992), cost, maintainability, flexibility and area, in terms of logic and space requirements (N. S. Woo and Wolf, 1994). hardware solutions may provide higher performance by supporting the parallel execution of operations, but there is the cost of fabricating one or more ASICs1 . On the other hand, software solutions will run on high-performance processors available at low cost due to high-volume production. However, the serialization of operations and the lack of specific support for some tasks can decrease performance (Micheli, 1993). A framework for co-design means a methodology along with a complementary set of tools for the specification, development, simulation/prototyping and testing of systems. It may be suitable for general-purpose applications (M. D. Edwards, 1993; Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b) or for a specific domain (Kalavade and Lee, 1992; Kalavade and Lee, 1993), but, in general, the co-design methodology consists of the steps presented in Figure 2.1. The following sections introduce each step. An example of a framework for co-design is PTOLEMY (J. Buck and Messerschmitt, 1990; University of California and Science, 1992), a software environment for the simulation and prototyping of heterogeneous systems. It uses object-oriented software technology to model sub-systems using different techniques and has mechanisms to integrate these sub-systems into a complete model.
11
System Specification
Figure 2.1. The co-design flow
Another example of framework for co-design is POLIS (F. Balarin, 1997), which makes use of a methodology for specification, automatic synthesis and validation of embedded systems (D. D. Gajski and Gong, 1994). Design is done in a unified framework, with unified hardware/software representation, so as to prejudice neither hardware nor software implementation (Wolf, 2002). This model is maintained throughout the design process, in order to preserve the formal properties of the design. POLIS uses the PTOLEMY environment to perform co-simulation. Partitioning is done using this co-simulation, thus allowing a tight interaction between architectural choices and performance/cost analysis. The COSYMA system (Ernst and Henkel, 1992; R. Ernst and Benner, 1993) is a platform for co-synthesis of embedded architectures. It follows a softwareoriented co-synthesis approach, in which as many operations as possible are implemented in software. External hardware is generated only when timing constraints are violated. COSYMA uses the OLYMPUS (G. De Micheli and Truong, 1990) high-level synthesis tool for hardware synthesis and simulation.
2.2
System Specification
For system specification, we can use a hardware description language, such as VHDL (Ecker, 1993a; Ecker, 1993b; Wendling and Rosenstiel, 1994). The development of VHDL as a general design tool was to overcome the problem of a design becoming target dependent. In this way, a design may be created in VHDL and the target technology chosen after the fact. The design may then be synthesized into a target technology by the application of synthesis tools. Once the technology changes, it will be necessary to re-synthesize the design only, instead of a complete logic redesign. The resulting benefit to the user
12
The Co-design Methodology
is a dramatic reduction in time-to-market by saving on learning curve time, redesign time and design entry time. Another possibility is of using a high-level description language, such as HardwareC (Gupta, 1993; R. K. Gupta and Micheli, 1992a; R. K. Gupta and Micheli, 1992b; R. K. Gupta and Micheli, 1994; Gupta and Micheli, 1992; Gupta and Micheli, 1993) for the system specification. The HardwareC language (Ku and Micheli, 1990) has a C-like syntax and supports timing and resource constraints. This language supports specification of unbounded and unknown delay operations that can arise from data dependent decisions and external synchronization operations. A high-level programming language, such as C or C++, can also be used for system specification (M. D. Edwards, 1993; Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b). In this case, we can say that the design follows a software-oriented approach, in contrast to the hardware-oriented approach provided by VHDL and HardwareC. The translation to a hardware description language is made at a later stage in the design process. There are some languages used for specific applications, such as the Specification and Description Language (SDL ) (Rossel and Kruse, 1993), for communication systems, and PROMELA(A. S. Wenban and Brown, 1992), a high-level concurrent programming language used for communication protocol specifications. In (N. S. Woo and Wolf, 1994; N. S. Woo and Dunlop, 1992), an ObjectOriented Functional Specification language (OOFS) is used, in order to avoid biasing the initial specification to hardware or software. When using PTOLEMY (Kalavade and Lee, 1992; Kalavade and Lee, 1993), the specification of the system is described in domains, which provides a different computational model, such as the Synchronous Data Flow (SDF ) domain, for filters and signal generators, and the digital-hardware modeling (Thor) domain, for digital hardware.
2.3
Hardware/Software Partitioning
Given some basis for evaluating system performance, it is possible to decide which tasks should be implemented as hardware and which as software. If a task interacts closely with the operating system, software may be the only feasible implementation. Likewise, if a task interacts closely with external signals, implementing it in hardware may be the only practical solution. We can determine which to pursue according to the following criteria: dynamic properties of the system: a characterization of how the execution time of a task impacts system performance; static properties of the task: the difference in execution times between hardware and software implementations of the task;
Hardware/Software Partitioning
13
hardware costs: the amount of custom hardware required to realize a hardware implementation of the task. The first consideration takes into account how the system performance depends on the execution time of each task, which in turn depends on the criterion by which system performance is measured. In the second case, some tasks are inherently much better suited for hardware implementation than others. To quantify these differences, we must identify properties of a task behavior that indicate how software and hardware implementations of the task will perform. In considering the amount of custom hardware necessary to reduce a task execution time, we must see that, for some tasks, custom hardware implementations might perform well, but be impractical due to high gate counts or memory requirements. For others, there may be a range of achievable performance gains, depending on how much of the custom hardware is devoted to the task. In the case of embedded systems (D. D. Gajski and Gong, 1994; Wolf, 2002), a hardware/software partition represents a physical partition of the system functionality into application-specific hardware (coprocessor) and software executing on one or more processors. The hardware/software interface strongly affects partitioning. The aim in this kind of architecture is to improve the system performance. In this sense, partitioning seeks to maximize the overall speedup for a given application. The speedup estimation can be done by a profiling analysis that takes into account typical data sets over which the application behavior is estimated. Due to this data dependence, in some application areas the speedup may not be a well-defined metric or it may not be a useful metric either, particularly in those with real-time response requirements. In such cases, the size of the implementation and timing constraint satisfaction are used to drive the partitioning decision. Some partitioning strategies are discussed in (Micheli and Gupta, 1997). In (M. D. Edwards, 1993; Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b), from a system specification written in C, partitioning is based on the identification of performance critical regions, using an interactive profiling tool. The partitioning tool (Wright, 1997) then translates the C code for the body of a critical region to a hardware description language representation, in order to enable high-level hardware synthesis. The identified C source code for a critical region is adapted to implement a “hardware” call/return mechanism, which invokes the associated operations in the hardware component (see Section 3.1.2). Another method implements partitioning based on the fact that some operations are best performed by hardware, others by software and some either by hardware or by software (N. S. Woo and Wolf, 1994; N. S. Woo and Dunlop,
14
The Co-design Methodology
1992). This yields three groups of operations: hardware (H), software (S) and co-design (C). Each object and operation in the S group is defined in the C++ language. Each object and operation in the H group is defined in C++ in order to emulate the object and its operation. These C++ programs will be used for a simulation of the whole system. Each object and operation in the C group is defined in OOFS (Object-Oriented Functional Specification). The system designers indicate whether each specification in the last group will be implemented by software or by hardware, which will define the translation from the co-specification language to C++ or BESTMAP-C (Laboratories, 1992), a hardware description language, respectively. The HardwareC description used in (Gupta, 1993; R. K. Gupta and Micheli, 1992a; R. K. Gupta and Micheli, 1992b; R. K. Gupta and Micheli, 1994; Gupta and Micheli, 1992; Gupta and Micheli, 1993) is compiled into a system graph model based on data-flow graphs. This graph consists of vertices representing operations and edges which represent serialization among operations. Overall, the system graph model is composed of concurrent data flow sections, which are ordered by the system control flow. Considering an initial solution in hardware, some operations are selected for moving into software, based on a cost criterion of communication overheads. With the serialization of the operations and analysis of the corresponding assembly code, delays through the software component can be derived. The movement of operations to software is then constrained by the satisfaction of timing constraints. As a second approach to system partitioning, the effect of non-determinism is taken into account, which is caused either by external synchronization operations (send, receive) or by internal data-dependent delay operations (loops, conditionals). System partitioning is performed by decoupling the external and internal points of non-determinism in the system model. The external non-deterministic points are implemented using application-specific hardware and the internal non-deterministic points are implemented in software. If the initial partition is feasible, then it is refined by migrating operations from hardware to software, in search for a lower cost feasible partition. The partition is indicated by “tagging” the system graph’s vertices to be either hardware or software. In (Ernst and Henkel, 1992; R. Ernst and Benner, 1993), as many operations as possible are implemented in software, using a subset of the C language, called C. External hardware is only generated when timing constraints are violated or in the case of I/O functions. The C-system description is parsed into a syntax graph CGL including all constraints, on which partitioning is performed. Those statements, which shall be implemented in software, are then translated to regular C. The original program structure is kept throughout the partitioning process.
Hardware Synthesis
15
In PTOLEMY (Kalavade and Lee, 1992; Kalavade and Lee, 1993), partitioning is performed manually, based on speed, complexity and flexibility requirements. The specification to be implemented in custom hardware and written in SDF2 is translated to SILAGE (Hilfinger, 1985), a functional language. This provides a link to high-level synthesis systems, that use SILAGE as specification of their inputs (Goosens, 1993; Rabaey, 1991). POLIS (F. Balarin, 1997) provides the designer with an environment to evaluate design decisions through feedback mechanisms, such as formal verification and system co-simulation. As a result of partitioning, we find finite state machine sub-networks chosen for hardware implementation and those chosen for software implementation. COSYMA (R. Ernst and Benner, 1993) executes hardware/software partitioning on a graph representation of the system by marking nodes to be moved to hardware. It follows an iterative approach, including hardware synthesis, compilation and timing analysis of the resulting hardware/software system. Simulation and profiling identify computation-time-intensive system parts. An estimation of the speedup is obtained through hardware synthesis and the communication penalty for nodes moved to hardware. The communication between the processor and the application-specific hardware implies additional delays and costs related to the interface circuit and protocol implemented. Once the integration phase has finished, simulation results should be returned to the partitioning stage, so that these additional parameters may be used to obtain a better hardware/software partition, i.e., one that will provide a better performance/cost trade-off.
2.4
Hardware Synthesis
Hardware synthesis is employed to generate the logic network for the hardware component. Synthesis is concerned with a reduction in design time and project cost, together with a guarantee of correctness for the circuit. The strategy for the synthesis process, given the behavioral specification of the system and a set of constraints, consists of finding a structure that implements the behavior and, at the same time, satisfies the constraints. A fundamental concept in the design of synthesis tools is that, from an implementation independent system specification, and based on the target technology libraries and user constraints, a number of possible implementations can be obtained (Jay, 1993). With the complexity found in new technologies, it is crucial to employ automatic synthesis tools (G. De Micheli and Truong, 1990), as part of the design process. Automatic synthesis presents some facilities, such as:
shorter design cycle, which means that a product can be completed faster, decreasing the cost in design and the time-to-market;
16
The Co-design Methodology
fewer errors, since the synthesis process can be verified, increasing the chance that the final design will correspond to the initial specification; the ability to search the design space, since the synthesis system can offer a variety of solutions from the same specification, i.e., a particular solution can be chosen by the designer in order to satisfy tradeoffs between cost, speed or power; documenting the design process, which means keeping track of the decisions taken and their effects; availability of integrated circuit technology to more people, as more design expertise is moved into the synthesis system, allowing non-expert people to produce a design.
2.4.1
High-Level Synthesis
High-level synthesis (M. C. McFarland and Camposano, 1990) obtains an optimized register-transfer description from an algorithmic one, which can be written in a high-level programming language or a hardware description language. The register-transfer description consists of an interconnected set of functional components (data path) together with the order in which these components are activated (control path). An example of high-level synthesis tool is the System Architect’s Workbench (Thomas, 1990). From the algorithmic description, a Control and Data Flow Graph (CDFG) can be derived, which specifies a set of data manipulation operations, their data dependencies and the order of execution of these operations. This graph can be optimized through the use of transformations similar to those found in a software compiler. Later on, scheduling and allocation are undertaken. The scheduling process means assigning data operations for execution in particular clock cycles. This allows the minimization of the number of clock cycles needed for the completion of the algorithm, given a set of hardware resources, which involves maximizing the number of operations that can be done in parallel, in each cycle. Some scheduling algorithms are considered in (Micheli and Gupta, 1997). The allocation task (M. C. McFarland and Camposano, 1990) assigns data operations to hardware resources, minimizing the amount of hardware required, in order to meet cost, area and performance constraints. A sub-task is module binding, where known components are taken from a hardware library. If there are not adequate components, then it might be necessary to generate a specialpurpose hardware, through the use of a logic synthesis tool. For the control path, an implementation must be found using finite state machines with hardwired or micro-programmed control.
Hardware Synthesis
2.4.2
17
Implementation Technologies
We can identify three types of logic devices (Coli, 1993): standard logic; programmable ASICs; masked ASICs. Standard logic devices are not specific to any particular application, being connected together to build an application. Programmable Application Specific Integrated Circuits (ASICs) include simple Programmable Logic Devices (PLDs), complex PLDs and Field Programmable Gate Arrays (FPGAs) (Wolf, 2004). Programmable ASICs are purchased in their un-programmed state and, then, programmed by the user for a specific application. Masked ASICs include gate-arrays and standard cells, being designed by the user for a specific application, tooled by the chosen supplier and, finally, purchased as a custom product. Circuit density and input/output count generally increase as one moves from standard logic to programmable ASICs and, then, to masked ASICs. Standard logic has the appeal of being purchased as a standard product, but compromises system integration by restricting the designer to interconnect many “catalogue” circuits. Masked ASICs are the opposite situation, in which they can be easily customized for the user’s specific application, but must be custom tooled and, therefore, purchased as a custom product. Programmable ASICs are a good compromise, because they are purchased as a standard product, but may be used as a custom product. The configuration of Field Programmable Logic Devices (FPLDs) (York, 1993) is achieved by either EPROM technology, anti-fuse or embedded register. EPROM and embedded registers are erasable, while the anti-fuse approach is one-time programmable. The embedded register configuration technique is commonly referred to as embedded RAM. All of the devices that are based on an embedded register produce the configuration data using shift registers, which, by its nature, is a serial process. The time-consuming delays, which arise due to the complexity of silicon processing, are avoided and this helps to facilitate a shorter time-to-market. It is now possible to implement systems of appreciable complexity on a single field programmable chip. A logic description of the circuit to be implemented in a Field Programmable Gate Array (FPGA) is created and, then, synthesized into a netlist. Synthesis is a critical step in the FPGA design process. For this purpose, there is a variety of input/output formats and tools available. The common goal of all these tools is to simplify the design process, maximize the chance of
18
The Co-design Methodology
Table 2.1. Some characteristics of the XC4000 family of FPGAs Device
XC4002A XC4008 XC4020
Approximate gate count CLB matrix Number of CLBs Number of flip-flops Max decode inputs (per side) Max RAM bits Number of IOBs
2,000 8×8 64 256 24 2,048 64
8,000 18 × 18 324 936 54 10,368 144
20,000 30 × 30 900 2280 90 28,800 240
first-time-success and accelerate time-to-market. Unlike masked ASIC tools, which are workstation based, a wide variety of FPGA tools are available on PCs as well as workstations. This can lower the design cost of an FPGA. The high quality and variety of FPGA CAE/CAD tools enhance FPGA design productivity. The availability of these tools on PCs and with open frameworks lowers barriers to use and promotes user-friendly design (Coli, 1993). FPGAs took advantage of the concept of multiple AND/OR arrays and local connectivity, introduced by complex PLDs (Coli, 1993; Wolf, 2004). An FPGA die offers many small arrays, called logic cells, dispersed around the chip. These logic cells are connected like a gate array using programmable interconnections. An example of FPGA is the XC4000 logic cell array family, produced by Xilinx (Xilinx, 1992; Xilinx, 1993), using CMOSSRAM technology. It provides a regular, flexible, programmable architecture of Configurable Logic Blocks (CLBs), interconnected by a powerful hierarchy of versatile routing resources and surrounded by a perimeter of programmable Input/Output Blocks (IOBs). Figure 2.2 shows a CLB and programmable switch matrixes, which allow for the connectivity between other CLBS and IOBs. The devices are customized by loading configuration data into the internal memory cells. The FPGA can either actively read its configuration data out of an external serial or byte-parallel PROM (master mode) or the configuration data can be written directly into the FPGA (slave and peripheral modes). Table 2.1 shows the characteristics of some FPGAs from the XC4000 family. The XC4000 family is the first programmable logic device to include on-chip static memory resources. An optional mode for each CLB makes the memory look-up tables usable as either a (16 × 2) or (32 × 1) bit array of read/write memory cells. The RAMs are very fast: read access is the same as a logic delay, about 5ns; write time is about 6ns; both are several times faster than any off-chip solution. This feature creates new possibilities for the system
Hardware Synthesis
19
Figure 2.2. Typical CLB connections to adjacent lines
designer: registered arrays of multiple accumulators, status registers, index registers, DMA counters, distributed shift registers, LIFO stacks and FIFO buffers are all readily and easily implemented. The Xilinx FPGAs are configured by the XACT design tools (Xilinx, 1994), the Xilinx development system, running on PCs or popular workstations. After schematic – or equation-based entry, the design is automatically converted to a Xilinx Netlist Format (XNF). XACT interfaces to popular design environments like Viewlogic, Mentor Graphics and OrCAD. The XC4000 series devices are used within our co-design environment and the basic steps for their configuration is presented in Section 3.1.3. Another example of a commonly used FPGA is the Intel FLEXlogic iFX780 family (Intel, 1994), which is fabricated using 0.8µ CHMOS EPROM technology. These devices are widely used in the target architecture of our co-design system. It consists of 8 Configurable Function Blocks (CFBs) linked by a global interconnect matrix. Each CFB can be defined either as a 24V10 logic block (i.e., 24 input/output lines between the block and the global interconnect matrix, and 10 input/output lines between the block and the external pins) or as a block
20
The Co-design Methodology
of 128 × 10 SRAM, as described in Figure 2.3. Any combination of signals in the matrix can be routed into any CFB, up to the maximum fan-in of the block (24). This combination will provide approximately 5,000 gates of logic. The SRAM has a minimum read/write cycle of 15ns, comparable to off-chip commercial ones. The combination of features available in the iFX780 makes it ideal for a wide variety of applications, such as bus control, custom cache control and DRAM control. The combination of SRAM and logic in a single device becomes a big advantage when designing communication controllers or bus interface controllers, where memory is required for buffering data in addition to the logic for the controller itself. The Intel FLEXlogic FPGA family is supported by industry standard design entry/programming environments, including Intel PLDshell PlusT M software (Intel, 1993), which runs on PCs. Third party tools support will be provided by vendors like Cadence, Data I/O, Logical Devices, Mentor Graphics, Minc, OrCAD, Viewlogic.
2.4.3
Synthesis Systems
The purpose of a synthesis system is to provide the designer with automatic synthesis tools, which lead to an implementation configuration from an implementation independent specification of the system and is based on the target technology. Besides synthesis itself, it usually provides the designer facilities for validation and simulation of the system specification, before the final configuration is obtained in the form of a gate netlist. COSMOS (A. A. Jerraya and Ismail, 1993) is an example of synthesis environment that allows the translation of a system specification written in SDL into VHDL, at the behavioral and register-transfer levels. Another example is AMICAL (I. Park and Jerraya, 1992), an interactive architectural synthesis system based on VHDL, which starts with a behavioral specification in VHDL and generates a structural description that may feed existing silicon compilers acting at the logic and register-transfer levels. The Princeton University Behavioral Synthesis System (PUBSS) (W. Wolf and Manno, 1992) and the Siemens High-Level Synthesis System (CALLAS) (J. Biesenack, 1993; Stoll and Duzy, 1992) are also examples of synthesis systems that use VHDL for the initial specification of a system. The OLYMPUS synthesis system (G. De Micheli and Truong, 1990) was developed at Stanford University for digital design. It is a vertically integrated set of tools for multilevel synthesis, technology mapping and simulation. The system supports the synthesis of ASICs from behavioral descriptions, written in HardwareC. Internal models represent hardware at different levels of abstraction and provide a way to pass design information among different tools. The OLYMPUS system includes behavioral, structural and logic synthesis tools. Since it is targeted for semi-custom implementations, its output is in terms of
Software Compilation
21
Figure 2.3. Intel FLEXlogic iFX780 configuration
gate netlists. Instead of supporting placement and routing tools, OLYMPUS provides an interface to standard physical design tools, such as MISII (Brayton, 1987) for multiple-level logic optimization. Examples of co-design systems using OLYMPUS for hardware synthesis can be found in (M. D. Edwards, 1993; Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b). Another example of synthesis system is ASYL+/Programmable Logic Synthesizer (ASYL+/PLS) (Technologies, 1995), which is dedicated to the FPGA/ CPLD3 user. It maps directly to physical cells in an optimized way. Xilinx designs, for instance, are mapped to CLBs or F-map and H-map primitives. Automatic migration is also implemented, so a design captured using the primitives associated with one FPGA or PLD family and even an ASIC library can be targeted to another. ASYL+/PLS comprises the ASYL+/VHDL tool, which accepts all classical constructs of synthesizable VHDL in one of the broadest VHDL support sets available.
2.5
Software Compilation
In PTOLEMY (Kalavade and Lee, 1992; Kalavade and Lee, 1993), the parts of the SDF specification to be implemented as software are sent to a code generation domain, that will generate the code for the target processor. Hence,
22
The Co-design Methodology
we can find an SDF domain synthesizing C code and a domain synthesizing assembly code for DSP. For communication protocols written in PROMELA (A. S. Wenban and Brown, 1992), a software compiler translates the PROMELA program into C++, suitable for embedded controllers. The user may need to supply additional C++ code, in order to support any external hardware interfaces. In (M. D. Edwards, 1993; Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b), from a system specification written in C, performance-critical regions are selected for hardware implementation, using an interactive profiling tool. The identified C source code for a critical region is then adapted to implement a “hardware” call/return mechanism, which invokes the associated operations in the hardware component. A more detailed description will be presented in Section 3.1.2. In POLIS (F. Balarin, 1997), the finite state machine-like sub-network chosen for software implementation is mapped into a software structure that includes a procedure for each machine. The first step consists of implementing and optimizing the desired behavior in a high-level, processor-independent representation of the decision process similar to a control/data flow graph. Subsequently, the control/data flow graph is translated into portable C code to be used by any available compiler, to implement and optimize it in a specific, microcontrollerdependent instruction set.
2.6
Interface Synthesis
Interface synthesis involves adding latches, FIFOs or address decoders in hardware and inserting code for I/O operations and semaphore synchronization in software. The hardware interface circuitry can be described in a separate behavioral model. PTOLEMY offers some options, such as inter-processor communication (IPC) or communication between custom hardware and processors. In modeling software, we represent an I/O system call, such as read or write to the hardware device, by an appropriate inter-process communication primitive. In the case of a process described in a hardware description language, where data transfer and synchronization are often represented as explicit port operations, a single inter-process communication primitive represents the set of port operations that perform the data transfer and associated synchronization. In (M. D. Edwards, 1993; Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b), after partitioning, the resulting hardware description of the critical region, selected for hardware implementation, includes the specification for parameter and control passing. Hence, during hardware synthesis, the hardware function is synthesized together with the data transfer and synchronization controls.
System Integration
23
In POLIS (F. Balarin, 1997), interface between different implementation domains (hardware and software) is automatically synthesized. These interfaces come in the form of cooperating circuits and software procedures (I/O drivers) embedded in the synthesized implementation. Communication can be through I/O ports available on the microcontroller or general memorymapped I/O.
2.7
System Integration
Once the hardware synthesis and software compilation phases are complete, the next step is to integrate both hardware and software components. One possibility is to use a prototyping system consisting of general-purpose processor assisted by application-specific hardware (Ernst and Henkel, 1992; R. Ernst and Benner, 1993; Gupta, 1993; R. K. Gupta and Micheli, 1992a; R. K. Gupta and Micheli, 1992b; R. K. Gupta and Micheli, 1994; Gupta and Micheli, 1992; Gupta and Micheli, 1993). Another possibility is to use a reconfigurable system, for the hardware component, and a computer system, for the software component. It is also possible to use a framework, such as PTOLEMY (Kalavade and Lee, 1992; Kalavade and Lee, 1993), in order to implement co-simulation and verify the performance of the co-design system. In each of these approaches, the performance results obtained during the system integration phase may be used to change the system partitioning, if the initial constraints, such as speed, cost and logic size, are not satisfied. In PTOLEMY, the SDF domain supports simulation of algorithms and also allows functional modeling of components, such as filters. The Thor domain implements the Thor simulator, which is a functional simulator for digital hardware and supports the simulation of circuits from the gate level to the behavioral level. Thor, hence, provides PTOLEMY with the ability to simulate digital components ranging in complexity from simple logic gates to programmable DSP devices. The mixed-domain simulation is executed, using the components synthesized so far. POLIS (F. Balarin, 1997) uses PTOLEMY as a simulation engine. Another example of simulator is POSEIDON, used by (R. K. Gupta and Micheli, 1992a; R. K. Gupta and Micheli, 1992b; R. K. Gupta and Micheli, 1994) which performs concurrent execution of multiple functional models implemented either as a program or as application-specific hardware. Input to POSEIDON consists of gate-level descriptions of the ASIC hardware, assembly code of the software and a description of their interface. The system specification is written using model declarations, model interconnections, communication protocols and system outputs. The interface protocol for data-transfer between models is specified via guarded commands. The gate-level description
24
The Co-design Methodology
Figure 2.4. Target architecture with parameter memory in the coprocessor
of the hardware component is generated using structural synthesis techniques provided by the OLYMPUS synthesis system (G. De Micheli and Truong, 1990). The first version of the development board presented in (M. D. Edwards, 1993; Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b) comprised of a processor, system memory, input/output section, custom hardware and AT bus interface. The custom hardware consisted of a parameter memory, a dual-port controller and the coprocessor, which implemented the synthesized hardware function, as shown in Figure 2.4. The dual-port controller was responsible for the communication between the coprocessor internal bus and the system bus, during parameter passing. Parameters were transferred between the coprocessor parameter memory and the system memory as either 16-bit integers, pointers or data arrays, using Direct Memory Access (DMA). In the present version (Edwards and Forrest, 1995; Edwards and Forrest, 1996a), described in Chapter 3 with more details, parameters are transferred directly to the coprocessor as either 16-bit integers or pointers, as can be seen
System Integration
25
Figure 2.5. Target architecture with memory-mapped parameter registers
from Figure 2.5. Data arrays are kept in the system memory and only their pointers are passed to the coprocessor. The coprocessor bus interface contains memory-mapped registers for configuring and controlling operation of the coprocessor. Another example of a development board is the HARP reconfigurable computer (Hoare and Page, 1994a; Hoare and Page, 1994b; de M. Mourelle, 1998; Page, 1994; Page, 1995; Page and Luk, 1993), a computing platform developed at Oxford university. It consists of a 32-bit RISC microprocessor, with 4MB DRAM, closely coupled with a Xilinx FPGA processing system with its own local memory. The microprocessor can download the hardware configuration into the FPGA via the shared bus. The target architecture used in (Ernst and Henkel, 1992; R. Ernst and Benner, 1993; Gupta, 1993; R. K. Gupta and Micheli, 1992a; R. K. Gupta and Micheli, 1992b; R. K. Gupta and Micheli, 1994; Gupta and Micheli, 1992; Gupta and Micheli, 1993) consists of a general-purpose processor assisted by applicationspecific hardware components (ASICs), as shown in Figure 2.6. The hardware modules are connected to the system address and data busses. Thus, all communication between the processor and the other hardware modules takes place over a shared medium. The re-programmable component is always the bus master. Inclusion of such functionality in the application-specific component would greatly increase the total hardware cost. All the communication between the re-programmable component and the ASIC components is achieved over named channels, whose width is the same as the corresponding port widths used by read and write instructions in the software component. The re-programmable component contains a sufficient number of maskable interrupt input signals. These interrupts are un-vectored and there exists a predefined unique destination address associated with each interrupt signal.
26
The Co-design Methodology
Figure 2.6. Target architecture using a general-purpose processor and ASICs
Computer designers can connect the gates of FPGAs into an arbitrary system merely by loading the chip’s internal RAM with configuration data. By combining FPGAs with external RAMs, microprocessors and digital signal processors, designers can create a Reconfigurable System (RS). Plugged into a standard PC, the reconfigurable system would perform functions normally performed by special-purpose cards. Such an arrangement has the following advantages: faster execution – since FPGAs operate at circuit speeds, they can compute much faster than pure software functions; however, their wiring and logic delays make them slower than equivalent mask-programmed gate array s or ASICs; low cost – an RS is much cheaper for each new application than an ASIC; configuring an RS for a new task requires that the PC user reprograms the connections of the logic gates in each FPGA; low power, small volume – one RS can take over the non-concurrent functions of several dedicated, special-purpose cards, reducing the size and power consumption of the PC system; increased innovation – because reconfiguring an RS is similar to loading a new software program, the costs of creating a new system will be much lower. These reconfigurable systems can be used as testbeds for co-design, providing a board for implementing the hardware component, along with the means for communicating with the software component. Examples of testbed
Summary
27
implementations are SPLASH (M. Gokhale, 1991), ANYBOARD (D. E. Van den Bout, 1992), Rasa Board (D. E. Thomas and Schmit, 1993) and the multiple-FPGA-based board presented in (Wendling and Rosenstiel, 1994).
2.8
Summary
In this chapter, we introduced the co-design approach. Then, the co-design methodology was outlined by describing each of the steps involved, together with recent work in this area. In the next chapter, we will discuss the co-design system developed at UMIST, along with experimental results obtained.
28
Notes 1 Application Specific Integrated Circuits. 2 SDF stands for Synchronous Data Flow. 3 CPLD stands for Complex PLDs.
The Co-design Methodology
Chapter 3 THE CO-DESIGN SYSTEM
The purpose of hardware/software co-design is to engineer systems containing an optimum balance of hardware and software components, which work together to achieve a specified behavior and fulfill various design criteria, including meeting performance targets (Wolf, 1994). A general-purpose co-design methodology should allow the designer to progress from an abstract specification of a system to an implementation of the hardware and software sub-systems, whilst exploring tradeoffs between hardware and software in order to meet the design constraints. A co-design environment provides software tool support for some, or all, of the co-design operations and may include an integrated development system consisting of a microcomputer or microcontroller and programmable hardware for system prototyping purposes. The co-design in this book concentrates on the realization of hardware/ software systems where the primary objective is to enhance the performance of critical regions of a software application, characterized as a software approach (Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996b). In our case, a critical region is part of an application where either a software solution cannot meet the required performance constraints, and a hardware solution must be found, or the overall performance can be usefully accelerated by implementing that region in hardware. In order to produce viable hardware/software systems, we have created a development environment, which supports the co-synthesis and performance evaluation of such systems. An application is firstly implemented as a C program and identified critical regions are synthesized automatically for implementation in a field programmable gate array (FPGA).
29
30
3.1
The Co-design System
Development Route
The development route consists of a sequence of stages for the translation of a C source program into machine code for a microcontroller and FPGA configuration data, as shown in Figure 3.1, following the co-design methodology presented in Section 2.4. Profiling consists of identifying performance critical regions in the C source program. The Amdahl’s law (Amdahl, 1967) can be employed to determine the performance gain that can be obtained by implementing these critical regions in hardware. It states that “the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used” (Amdahl, 1967). The speedup that can be gained corresponds to the ratio in (3.1).
speedup =
execution time of software-only implementation execution time of software/hardware implementation
(3.1)
As presented in (Edwards and Forrest, 1996b), the overall speedup of an application Shs can be defined as in (3.2)
Shs =
Ssc 1 + µ(Scs − 1)
(3.2)
wherein Scs corresponds to the speedup obtained by executing the critical region in hardware and µ to the fraction of the computation time in the software implementation that is not enhanced by the hardware implementation. As Scs tends to infinity, the overall speedup Shs obtainable becomes inversely proportional to µ. Therefore, the overall speedup increases as the frequency of use of the software implementation that is enhanced by the hardware implementation increases. This provides a good design trade-off in selecting potential critical regions for hardware implementation. Hardware/software partitioning consists of translating performance critical regions in the C source program into a hardware description for input to a hardware synthesis system. The identified C source code for a region is adapted to implement a “hardware” call/return mechanism, which invokes the associated
31
Development Route
Figure 3.1. Development route
operations in the FPGA. The modified C source program is, then, prepared for compilation. During hardware synthesis, the hardware description of a critical region is synthesized, through a series of synthesis tools, into a netlist for the specific FPGA. The configuration data is, then, translated into an initialized C array. Software compilation stage accepts the modified C source program and translates it into machine code to be executed by the co-design system. The run-time system consists of the necessary hardware to download the modified C code into the development hardware, execute it and return performance statistics.
3.1.1
Hardware/Software Profiling
Hardware/software profiling permits the identification of performance critical regions in the C source program. Deterministic, accurate timing analysis has been used to estimate program execution time against bounds in real time systems (Park and Shaw, 1991; Shaw, 1989). Knowledge about the dynamic behavior of the programs have also been used to estimate their performance (J. Gong and Narayan, 1993; Narayan and Gajski, 1992).
32
The Co-design System
In a first approach (Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b), the profiling software is PC-based and critical regions are identified interactively and can be either single functions or sequences of statements within a function. Performance statistics for a program and selected regions can be readily obtained by compiling and executing the code. Critical regions are identified by running the program with representative data. The performance information is acquired by the run-time system using timers of the development system hardware. The current values of a timer are read when the profiled region is both entered and exited. The profiling system temporarily modifies the original code, so that “time stamps” can be used to indicate entry and exit times from the regions selected by the user. The time stamps are generated by the hardware timers of the system hardware. Figures of the minimum, maximum and average execution times for a selected region, together with the number of times the region is executed, are computed and stored in the system memory of the development hardware. From this information, it is possible to ascertain where a program spends most of its time and, hence, which regions could benefit from software acceleration. At this stage, we naively assume that the execution time of a program is determined and does not depend on any asynchronous activities. We are, however, experimenting with a “statistical” form of profiling, where the value of the program counter of the microprocessor is sampled at random intervals. This allows us to determine the dynamic execution characteristics of a program and is useful for profiling interrupt-driven applications. In another approach (Nikkhah, 1997; B. Nikkhah and Forrest, 1996), the selection of candidate functions is performed automatically, based on profiling data. Functions are the unit of translation, but designers can readily rewrite other regions as functions. A hybrid static/dynamic policy for timing assessment is adopted, where the duration of each “basic block” in a program is estimated statistically for a particular processor. Program regions are compared by their relative timing characteristics. The program is run in a workstation environment and the estimated times of the executed basic blocks are summed up to give an estimated time for the program on the target environment. In a C program, basic blocks are identified by noting the occurrences of “for”, “while”, “do”, “if” and switch constructs. Constructs such as “break”, “continue” and “return” may denote the end of these blocks. The basic block execution time is estimated using the source code on an instruction-by-instruction basis. The estimation is independent of the processor on which the source program is executed. The processor dependent timing data, extracted from the manuals in clock cycles,
Development Route
33
is held for each basic operation of a C instruction. profiling and software timing estimation is done without executing the source program on the target architecture. In (Nikkhah, 1997; B. Nikkhah and Forrest, 1996), after the selection of candidate functions based on the software timing estimation, the area size estimation is performed. This allows the designer to eliminate those candidates that do not satisfy the area constraint.
3.1.2
Hardware/Software Partitioning
Partitioning (Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996a; Edwards and Forrest, 1996b) consists of translating the C code of a critical region into VHDL, in order to enable highlevel hardware synthesis of the identified region. The C source code of the critical region is adapted to implement the parameter passing and a “hardware” call/return mechanism, which invokes the associated operations in the FPGA. The implemented protocol consists of: parameter passing from the processor to the hardware sub-system; instigating the function in the FPGA; waiting until the hardware sub-system completes the operation, which depends on the interface mechanism adopted; accessing parameters returned by the hardware sub-system. The partitioning tool (Wright, 1997) generates a VHDL behavioral description of the critical region, as well as the adapted C source program. Specialpurpose hardware/software interface logic is generated automatically to implement the transfer of parameters and control between the two sub-systems, which is embedded in the VHDL description. An example of the application of this tool is provided in Appendix A.
3.1.3
Hardware Synthesis
The hardware synthesis tool is UNIX-based. It accepts the VHDL code for a critical region, generated by the partitioning tool, and produces the logic network for a Xilinx FPGA XC40xx . During this phase, different tools are used according to the type of output required, as described in Figure 3.2. Optimization. The VHDL behavioral description of a critical region may follow a structured programming style, i.e., include procedures and functions. This may yield a very large synthesis space. In order to be able to handle such a large space, high-level synthesis tools impose some constraints on the type of hardware description that can be handled, i.e., the synthesis tool may not support record types, some kind of “loop” statements (“loop . . . end loop”),
34
The Co-design System
Figure 3.2. Hardware synthesis process
some kind of “wait” statements (“wait for time expression”), “next” statements, “exit” statement, some kind of signal mode declarations (inout, buffer). The optimization is presently done by VOTAN (Siemens, 1994) (high-level VHDL Optimization, Transformation and Analysis), which is an easily extendible system of tools, that was designed to allow the use of high-level synthesis techniques with an established VHDL design flow. The two main aims of VOTAN are to pre-compile descriptions written for a dedicated RT-synthesis tool and raise the abstraction level of descriptions that can be synthesized. This tool fully enables a structured programming style by allowing procedures (with “wait” statements), functions and arbitrary nested loops and branches. Therefore, hierarchical descriptions that are easier to write and maintain, can
Development Route
35
be transformed into VHDL code, that is synthesizable by commercial RT-level synthesis tools. VOTAN offers a flexible scheduling technique intended to optimize the control flow in a selected VHDL process. In our case, the VHDL behavioral description of a critical region is presented as one process consisting of nested loops, controlled by “wait” statements, and several procedures. VOTAN accepts this input and produces a finite state machine, eliminating the nested loops and the procedure calls. Synthesis. The next step in the hardware synthesis process is to synthesize the optimized version of the VHDL description of a critical region, generated by VOTAN. We use ASYL+/PLS (Technologies, 1995) to obtain a netlist description for a Xilinx FPGA XC40xx . This tool does not support any “wait” statements, multiple “wait” descriptions, “loop” statements, “next” statements, “exit” statements. These restrictions are avoided by transforming the original VHDL description using VOTAN. Amapper transforms a technology-independent description into a technologydependent network of cells for the target technology. The initial specification may be a textual language description, such as VHDL, or a schematic or virtual library netlist, such as Electronic Design Interchange Format (EDIF) (Association, 1989). The mappers are technology specific, since they utilize the unique architectural features of the target technology. A netlist optimizer starts from a mapped netlist and optimizes the netlist to improve the performance of the circuit. In this case, the source and the target technology are identical, which allows the identification and optimization of critical paths according to special constraints. For instance, the most common option minimizes the longest path between two flip-flops or latches throughout the design, thus maximizing the clock frequency. ASYL+ performs a specific mapping on Xilinx 3000, 4000 and 5200 families of FPGAs. For FPGAs, the output is either a single or set of XNF1 files ready for processing by the Xilinx place and route tools. The netlist optimizer is available only for the X4000 series. Partitioning, placement and routing. Once the Xilinx netlist (XNF) is obtained from the synthesis tool, the next step consists of partitioning the design into logic blocks, then finding a near-optimal placement for each block and finally selecting the interconnect routing. These tasks are performed by the Xilinx development system XACT (Xilinx, 1993), which generates a Logic Cell Array (LCA2 ) file, from which a bit stream file can be generated. Linking. The necessary serial bit stream description, to configure the Xilinx FPGA, is obtained from the LCA file generated by the previous partitioning, placement and routing phase. This task is performed by the FROMLCA script,
36
The Co-design System
which generates an initialized C array (ximage.c) to be linked to the adapted C source program during software compilation process. The configuration itself is part of the program execution and so is performed by the software sub-system, once the adapted C code is loaded into the co-design system.
3.1.4
Software Compilation
The software compiler takes the adapted C source program, with the inserted “hardware” call/return and the parameter passing mechanism, and produces the processor machine code. During this phase, the configuration data, in the form of an initialized C array generated during the last phase of the hardware synthesis process, is linked with the adapted C code. The compilation phase generates an object code in the Motorola S-records format (src) to be loaded into the target processor memory.
3.1.5
Run-Time System
The run-time system (Edwards and Forrest, 1996a) executes the adapted C code and consists of the processor and coprocessor boards, shown in Figure 3.3. The processor board comprises a microcontroller MC68332 (MOTOROLA, 1990b), 2MB DRAM, 64KB SRAM and 2 FLEXlogic FPGAs iFX780 (Intel, 1994), which implement the necessary glue logic. The coprocessor board comprises a Xilinx FPGA X4010 (Xilinx, 1992), which implements the hardware function, and a FLEXlogic FPGA iFX780, which implements the bus interface. The adapted C code, together with the initialized C array containing the Xilinx configuration data, is downloaded into the run-time system memory via the Background Debug Mode (BDM3 ) port. The JTAG4 (Intel, 1993; Intel, 1994) port is used to configure the FLEXlogic FPGAs iFX780. The serial peripheral interface (SPI), time processing unit (TPU) and serial ports support low-level input-output facilities (MOTOROLA, 1990b). The development process, corresponding to the profiling, partitioning, software compilation and hardware synthesis, is UNIX-based. On the other hand, the FLEXlogic FPGAs iFX780 have to be configured as well. A system interface (A. Ali, 1995) was built to link the run-time system to the development host, as shown in Figure 3.4. There is a single RS232 serial connection, using a packet based protocol. Under the control of the workstation, the interface can configure the system FLEXlogic devices and operate the BDM port on the coprocessor board. The serial port from the processor, which is principally used to provide program diagnostics during execution, is routed to the workstation via the interface system.
37
Target Architecture
Figure 3.3. Run-time system
3.2
Target Architecture
The target architecture (Edwards and Forrest, 1995) for our co-design system has the configuration shown in Figure 3.5. The microcontroller, global memory and controllers are located in the processor board also called the software subsystem, since it is responsible for the execution of the adapted C code. The bus interface and coprocessor are located in the coprocessor board, also called the hardware sub-system, since it is responsible for the execution of the critical region (hardware function). Both sub-systems communicate through the main system bus. This type of shared bus architecture is commonly used in co-design systems (R. Ernst and Benner, 1993; D. E. Thomas and Schmit, 1993; N. S. Woo and Wolf, 1994). Integer and pointer parameters are passed to and from the coprocessor via memory-mapped registers, while data arrays are stored in the shared memory.
38
The Co-design System
Figure 3.4. Run-time system and interfaces
3.2.1
Microcontroller
The MOTOROLA MC68332 (MOTOROLA, 1990a; MOTOROLA, 1990b) consists of a 32-bit microcontroller unit (MCU), combining high-performance data manipulation capabilities with powerful peripheral sub-systems. It contains intelligent peripheral modules such as the time processor unit (TPU), which provides 16 micro-coded channels for performing time-related activities. Highspeed serial communications are provided by the queued serial module (QSM) with synchronous and asynchronous protocols available. The modules are connected on-chip via the inter-module bus (IMB). The system clock is generated by an on-chip circuit to run the device up to 16.78MHz, from a 32.768KHz crystal. The MCU architecture supports byte, word and long-word (32-bit) operands, allowing access to 8-bit and 16-bit data ports through the use of asynchronous
39
Target Architecture
Figure 3.5. Target architecture
cycles controlled by the data transfer (size1 : 0) and the data size acknowledge (dsak1 : 0) signals. The MCU requires word and long-word operands to be located in memory on word or long-word boundaries. The only type of transfer that can be misaligned is a single-byte transfer to an odd address. For an 8-bit port, multiple bus cycles may be required for an operand transfer due to either misalignment or a port width smaller than the operand size. The MCU contains an address bus (a23 : 0), that specifies the address for the transfer, and a data bus (d15 : 0), that transfers the data. Control signals indicate the beginning of the cycle, address space, size of the transfer and the type of the cycle. The selected device then controls the length of the cycle with the signals used to terminate the cycle. Strobe signals, one for the address bus (as) and another for the data bus (ds), indicate the validity of the address and provide timing information for the data. The bus and control input signals are internally synchronized to the MCU clock, introducing a delay, i.e., the MCU reads an input signal only on the falling edge of the clock and not as soon as the input signal is asserted. The function codes (f c2 : 0) select user or supervisor, program or data spaces. The area selected by f c2 : 0 = 111, for example, is classified as the CPU space, which allows the CPU to acquire specific control information not normally associated with real or write bus cycles, such as during an interrupt acknowledge cycle. There are seven prioritized interrupt request lines available (irk7 : 1), where irk(7) is the highest priority. The irk6 : 1 are internally maskable interrupts and irk(7) is non-maskable. When a peripheral device signals the MCU that the device requires service and the internally synchronized value on these signals indicates a higher priority than the interrupt mask in the status register (or that a transition has occurred in the case of a level 7 interrupt), the
40
The Co-design System
MCU makes the interrupt a pending interrupt. It takes an interrupt exception for a pending interrupt within one instruction boundary (after processing any other pending exception with a higher priority). When the MCU processes an interrupt exception, it performs an interrupt acknowledge cycle to obtain the number of the vector that contains the starting location of the interrupt service routine. Some interrupting devices have programmable vector registers that contain the interrupt vectors for the routines they call. Other interrupting conditions or devices cannot supply a vector number and use the auto-vector cycle, in which the supplied vector number is derived from the interrupt level of the current interrupt. The bus design of the MCU provides for a single bus master at a time: either the MCU or an external device. One or more of the external devices on the bus has the capability to become the bus master. Bus arbitration is the protocol by which an external device becomes the master. The bus controller in the MCU manages the bus arbitration signals so that the MCU has the lowest priority. Systems including several devices that can become bus master require external circuitry to assign priorities to the devices, so that, when two or more external devices attempt to become the bus master at the same time, the one having the highest priority becomes bus master first. The sequence of the protocol steps is: an external device asserts the bus request signal (br); the MCU asserts the bus grant signal to indicate that the bus is available (bg); the external device asserts the bus grant acknowledge signal to indicate that it has assumed bus mastership (bgack).
3.2.2
Global Memory
The global memory consists of 64KB SRAM and 2MB DRAM, the latter being directly accessible by both the microcontroller and the coprocessor. The DRAM is implemented by two 1M×8 bits DRAMs, allowing for byte and word access. The DRAM 10 address lines on the module are time multiplexed at the beginning of a memory cycle by two clocks: row address strobe (ras) and column address strobe (cas), into two separate 10-bit address fields. A cas active transition is followed by a ras active transition for all read or write cycles. The normal read cycle begins with both ras and cas active transitions latching the desired byte location. The write input (w) must be high before the cas active transition, in order to enable the read mode. The ras and cas clocks must remain active for a minimum time to complete the read cycle. Data out is valid as long
Target Architecture
41
as the cas clock is active. When the cas clock becomes inactive, the output will switch to high impedance. The write mode is enabled by the transition of w to low, before cas active transition. The data input is referenced to cas. The ras and cas clocks must stay active for some time after the start of the write operation in order to complete the cycle. Page mode allows for fast successive data operations at all column locations on a selected row. A page mode cycle is initiated by a normal read or write. Once the timing requirements for the first cycle are met, cas becomes inactive, while ras remains low. The second cas active transition, while ras is low, initiates the first page mode cycle. The dynamic RAM design is based on capacitor charge storage for each bit in the array. Thus each bit must be periodically refreshed to maintain the correct bit state. The refresh is accomplished by cycling through the 512 row addresses in sequence within the specified refresh time. All the bits on a row are refreshed simultaneously when the row is addressed. A normal read or write operation to the RAM will refresh all the bytes associated with the particular decoded row. One method for DRAM refresh is by asserting cas active before ras, which activates an internal refresh counter that generates the row address to be refreshed. External address lines are ignored during the automatic refresh cycle.
3.2.3
Controllers
The controllers consist of 2 FLEXlogic FPGAs iFX780, as shown in Figure 3.5. They implement the necessary “glue” functions for bus arbitration, bus monitoring, address multiplexing, DRAM accessing and refresh control. Bus arbitration is required in order to determine the bus master. Each coprocessor (notice that two coprocessor ports are available) is provided with a bus request signal. The bus arbitration control accepts these signals, plus the refresh request generated by the TPU, and asserts the bus request signal (br = 0) to the microcontroller. It, then, waits until the bus is granted (bg = 0) by the microcontroller. Subsequently, the requesting device is selected, according to the priority scheme adopted. The controller signals to the selected device that the bus is granted and asserts the bus grant acknowledge (bgack) to the microcontroller. The bus monitor control signals bus error (berr) to the MCU when no device responds by asserting dsack 1:0 or within an appropriate period of time after the MCU asserts signal as. This allows the cycle to terminate and the MCU to enter the exception processing mode for the processing of error conditions. DRAM control is used to signal ras and cas in the appropriate sequence, according to the desired operation. Notice that the DRAM consists of a pair of 1M×8 bits. During a byte operation, if the address is “even” the higherorder byte of the data bus is used, otherwise, the lower-order byte of the data
42
The Co-design System
bus is used instead. This means that each 1M×8 bits DRAM must be selected individually, which is achieved by controlling their cas signals separately. Since we are using a 16-bit port, the DRAM control asserts dsack(1) when the transfer is complete, while keeping dsack(0) negated. It also detects a refresh request and signals the necessary control signals to implement cas before ras refresh operation. The second controller is responsible for the address multiplexing operation. First, the row address is sent. As soon as the DRAM control indicates that the row address was latched, the address multiplexer sends the column address to the same 10-bit address output lines of the DRAM. During a long-word operation, this controller prepares the address of the next location. It also signals to the DRAM control if the present address is in the same page as the previous one, thus allowing for page mode operation.
3.2.4
Bus Interface
The purpose of this interface is to allow data parameters and control to be passed correctly between the two sub-systems. In our case, a C program is executed in the software except for a performance critical region, which is executed in the hardware sub-system. Assuming a single thread of control and the C program is being executed on a conventional processor, the sequence of events taken in executing the critical region in the hardware sub-system is as follows: parameters are passed from the software to the hardware sub-system; the thread of control is passed from the processor executing the C program to the hardware sub-system; the hardware sub-system executes the specified function; the thread of control is passed back to the processor from the hardware sub-system; the function results, if applicable, are read back by the software sub-system. Recall that the critical region, in the original program, has been replaced by a “hardware” call/return mechanism (see Section 3.1.2). There are two commonly used implementations of this mechanism: busy-wait and interrupt. In the busy-wait mechanism, the processor will start the hardware execution of the function and loop, reading a status flag, which indicates when the function has been completed. In the interrupt mechanism, the processor will again start the hardware-based function, but this time it will wait for an interrupt, which indicates completion of the function. The choice of interface mechanism can have a significant effect on the performance speedups obtained and is the subject of detailed analysis in next chapters.
43
Target Architecture
Figure 3.6. Bus interface control register
The bus interface has an 8-bit memory-mapped control register, shown in Figure 3.6, through which the processor directs the operation of the coprocessor. Signals Nprogram, rdy, done and Ninit are related to the Xilinx FPGA configuration (Xilinx, 1993). Signals copro-st, copro-dn and iack-di are related to the Xilinx FPGA operation, when executing the hardware function: copro-st: starts the hardware-based function execution; copro-dn: indicates that the hardware-based function has finished; iack-di: during interrupt, corresponds to the interrupt acknowledge; during busy-wait, disables interrupts from the coprocessor. Busy-wait. the microcontroller implements the following protocol, expressed in C: start (); {set copro-st = 1} while (!is_finished ()) {loop until copro_dn = 1} ack_stop (); {set copro-st = 0} This is a straightforward implementation of the busy-wait protocol. It is assumed that the parameters are transferred before the hardware function is started and the results after it has finished. Interrupt. the coprocessor uses interrupt line level 3 (irq3) from the microcontroller (see Section 3.2.1) to generate an interrupt once the hardware function has been executed. Automatic vectoring is used, whereby the microcontroller generates the appropriate interrupt vector internally. In the current set up, the microcontroller stops after starting the coprocessor and will resume operation only after an interrupt. Note that other interrupts can also be present in the
44
The Co-design System
system and any of these can restart the processor. The coprocessor completion is identified in the same way as it was for the busy-wait mechanism. Nevertheless the control register is read only after an interrupt, thus significantly reducing bus contention. In order to withdraw the interrupt request, the coprocessor needs an interrupt acknowledge signal. The MCU does not provide interrupt acknowledge lines. Since a hardware solution for the interrupt acknowledge signal could not be implemented in our current system, due to an area constraint, a software solution was adopted using the iack-di flag in the control register. The interrupt service routine sets this flag, which causes the coprocessor to unassert the interrupt request. This generates a small performance overhead when using interrupts. The microcontroller now implements the following protocol, which includes resetting the iack di at the same time as unasserting copro st:
start (); {sets copro_st = 1} wait_for_int (); {stops the CPU} ack_stop (); {sets copro_st = 0 and iack_di = 0}
Parameter passing is achieved via memory-mapped registers located in the coprocessor. The bus interface identifies the beginning of the transfer and signals to the microcontroller when the coprocessor has finished, by asserting the data size acknowledge signal (dsack1 = 0), following the bus transfer protocol (see Section 3.2.1). Moreover, the bus interface controls the coprocessor memory accesses. Once a memory request is detected, it signals to the bus arbiter (see Section 3.2.3) requesting the bus. As soon as the bus is granted, the necessary signals for a bus transfer are driven (see Section 3.2.1). The interrupt controller is implemented by the bus interface too, which detects the end of the coprocessor operation and asserts the corresponding interrupt request signal. It, then, waits for the interrupt acknowledge signal, as part of the handshake protocol. Notice that the coprocessor has a direct access to the data and address buses for parameter passing only. All the necessary control for this operation to take place and for the coprocessor memory accesses are generated by the bus interface itself.
Performance Results
3.2.5
45
The Coprocessor
Basically, the coprocessor consists of a Xilinx FPGA XC4010, implementing the critical region of the C source program. The coprocessor runs at half the speed of the microcontroller, due to the Xilinx internal delays. We can divide the FPGA into three components, interconnected by internal signals: the clock divider, data buffers and accelerator. These components were merged into a single configuration file (.lca) by the Xilinx tool XACT, during the partitioning, placement and routing stage of the hardware synthesis (see Section 3.1.3). The data buffers permit the interface between the 16-bit main system bus and the coprocessor 32-bit internal data bus. The bus interface provides the necessary control signals to select the appropriate part of the data buffers, according to the size of the data transfer (8-bit, 16-bit or 32-bit), and to control the data flow, according to the type of the operation. Due to the FPGA internal delays, it might be impossible to run an application at the same speed as the microcontroller (16.78MHz). In this case, the coprocessor will require its own clock. The clock divider provides the coprocessor internal clock, based on the main system clock. It is possible to generate a clock frequency of 8.39MHz or 4.19MHz. The accelerator implements the hardware function, selected during the partitioning stage (see Section 3.1.2), whose configuration is obtained during hardware synthesis (see Section 3.1.3). It contains the memory-mapped registers for parameter passing and the body of the hardware function to be executed.
3.2.6
The Timer
The timer is implemented by an INTEL FLEXlogic FPGA iFX780, located in the second coprocessor board (coprocessor port 2). It consists of a memorymapped counter, controlled by the software through a memory-mapped control register. When a specific part of the code needs to be timed, the counter is started just before entering the code. At the end of the timed region, the counter is then halted and read. The examples provided in the next section are timed by this external timer and its use in the source code can be seen in Appendix A.
3.3
Performance Results
The execution times of two programs PLUM and EGCHECK were measured using the busy-wait and interrupt mechanisms (L. de M. Mourelle and Forrest, 1996). This was done in order to determine if any noticeable differences in their performances could be detected. In order to provide an application environment of realistic embedded systems, a periodic interrupt timer (PIT) was implemented to generate “timing” interrupts at different rates (the internal timer for the MC68332 was used for this purpose). A special-purpose external counter, residing on the system bus,
46
The Co-design System
was used to provide execution time information for the two programs. The counter is based on the 16.78MHz microcontroller clock. The basic system timing parameters for both programs are given below: shared memory access time from the microcontroller: 300ns (5 clock cycles); control register access time from the microcontroller: 240ns (4 clock cycles); shared memory access time from the coprocessor: 1200ns (20 clock cycles); interrupt acknowledge cycle: 10020ns (167 clock cycles).
3.3.1
First Benchmark: PLUM Program
InAppendixA, we present the PLUM program and the corresponding adapted C program, generated by the synthesis tool Synth during partitioning. The function inner is synthesized in hardware, performing 800,000 shared memory accesses. With the PIT turned off, the time to execute the program in software is 6836ms. When executing the called function in the coprocessor, the execution time for the program reduces to 2234ms, when using the busy-wait mechanism, and 2319ms, when using the interrupt mechanism. This gives speedups of 3.1 and 2.9, respectively. Intuitively, this program has more coprocessor internal operations than memory accesses and it, therefore, seems more natural to use the busy-wait mechanism. In this case, there is little bus contention due to the non-frequent memory accesses (one memory access every 3µs). The system performance degrades when the PIT is turned on, as shown in Table 3.1. The times provided are in milliseconds (ms). The first column gives the period of the interrupt timer and the second the execution time of the pure software version of the program. The software overhead, in the third column, is the increase in the execution time caused by the interrupt timer, since the MCU has now to execute the corresponding interrupt routine too. The subsequent columns are related to the hardware/software implementation of the program. The fourth column provides the execution time when using the busy-wait mechanism. The column Error/bw indicates the error introduced into the program execution time by the busy-wait mechanism, which decreases as the time spent by the MCU in executing the interrupt routine decreases. The column Speedup/bw shows the speedup obtained for the same mechanism. The highest speedup obtained is when the MCU is heavily busy with the interrupt timer. This indicates that the move of some functionality to hardware helped reducing the overall execution time. Similarly, the last three columns provide the results obtained using the interrupt mechanism.
47
Performance Results Table 3.1. Performance results for PLUM PIT
Tsw
0.122 10901 0.244 8402 0.488 7539 0.977 7170 1.95 6999 3.91 6916 7.81 6876 15.62 6856 31.13 6846 ∞ 6836
overhead(%)
Ths bw
Err (%) bw
Speedup bw
Ths /int
Err (%) int
Speedup int
59.5 22.9 10.3 4.9 2.4 1.2 0.6 0.3 0.1 0.0
2441 2441 2441 2278 2250 2242 2235 2235 2236 2234
9.3 9.3 9.3 2.0 0.7 0.4 0.0 0.0 0.0 0.0
4.5 3.4 3.1 3.2 3.1 3.1 3.1 3.1 3.1 3.1
2848 2441 2441 2417 2345 2332 2374 2454 2323 2319
22.8 5.3 5.3 4.2 1.1 0.6 2.4 5.8 0.0 0.0
3.8 3.4 3.1 3.0 3.0 3.0 2.9 2.8 2.9 2.9
We can draw some conclusions now. As the frequency of the interrupts generated by the PIT increases, so does the amount of code executed by the MCU, corresponding to the interrupt routine. This leads to an increase in the bus contention. But, since PLUM has non-frequent memory accesses, bus contention is non-frequent as well. The slow interface between the coprocessor and the shared memory does not affect the execution time as much as the interrupt acknowledge cycle does, during the handshake completion, when the coprocessor finishes executing the function. The error introduced with the busywait mechanism is smaller than that introduced with the interrupt mechanism, due more to the handshake completion than to the memory accesses performed during the function execution.
3.3.2
Second Benchmark: EGCHECK Program
The EGCHECK program is presented in Appendix A, together with the corresponding adapted C program. The decode function, implemented in hardware, performs 6.4 million shared memory accesses. With the PIT turned off, the execution time of the program in software is 10917ms. When executing decode in the coprocessor, the execution time for the program reduces to 6193ms, when using the busy-wait mechanism, and 5518ms, when using the interrupt mechanism. This gives speedups of 1.8 and 2.0, respectively. Intuitively, the EGCHECK program has more shared memory accesses than coprocessor internal operations. Therefore, it seems more natural to use the interrupt mechanism. In this case, there is significant bus contention, due to the frequent memory accesses (more than one memory access every 1µs). The system performance degrades when the PIT is turned on, as shown in Table 3.2. Again, these results indicate that the interrupt timer has little effect, when its interrupt period is greater than 1ms. Faster interrupts indicate
48
The Co-design System
Table 3.2. Performance results for EGCHECK PIT
Tsw
overhead(%)
Ths bw
Err (%) bw
Speedup bw
Ths /int
Err (%) int
Speedup int
0.122 0.244 0.488 0.977 1.95 3.91 7.81 15.62 31.13 ∞
17385 13411 12037 11450 11177 11046 10981 10949 10933 10917
59.2 22.8 10.3 4.9 2.4 1.2 0.6 0.3 0.1 0.0
6300 6242 6217 6204 6199 6196 6194 6194 6193 6193
1.7 0.8 0.4 0.2 0.1 0.0 0.0 0.0 0.0 0.0
2.8 2.2 1.9 1.9 1.8 1.8 1.8 1.8 1.8 1.8
6303 5966 5733 5627 5571 5544 5531 5528 5521 5518
14.2 8.1 3.9 2.0 1.0 0.5 0.2 0.2 0.1 0.0
2.8 2.3 2.1 2.0 2.0 2.0 2.0 2.0 2.0 2.0
that the software overhead has become unacceptable. The columns in the table below are as those in Table 3.1. We can draw some conclusions here too. Since EGCHECK performs shared memory accesses frequently, as the frequency of interrupts generated by the PIT increases, so does the frequency of bus contention. Bus arbitration with the interrupt mechanism is faster than with the busy-wait mechanism. Therefore, the interrupt mechanism provides a better performance, according to the characteristics of this application in terms of number of memory accesses. On the other hand, the error introduced into the program execution time by the busy-wait is still smaller than that introduced by the interrupt. This indicates that, as the PIT increases, the corresponding increase in the code performed by the MCU interferes more in the execution times with interrupt than with busy-wait. This is because the handshake completion with interrupt introduces a considerable overhead due to the interrupt acknowledge cycle.
3.3.3
Results Analysis
Two different communication mechanisms were investigated: busy-wait and interrupt. In the first one, the microcontroller reads a memory-mapped control register continuously, leading to the problem of bus contention. In the second one, the microcontroller waits for an interrupt request of the coprocessor, indicating completion of operation, thus reducing bus contention, but introducing an additional time overhead corresponding to the interrupt acknowledge cycle. It was shown that the busy-wait mechanism is better than the interrupt mechanism when the coprocessor performs more internal operations than memory accesses. On the other hand, the interrupt mechanism is more suitable when there are more coprocessor memory accesses than internal operations.
49
Performance Results
Considering (3.2) in Section 3.1, we can introduce other factors that contribute for the performance of the system. The speedup obtained by executing the critical region in hardware is defined as in (3.3). Scs =
Tsw Thw
(3.3)
wherein Tsw is the software execution time of the critical region and Thw is its hardware execution time. The latter can be subdivided into five other components: Tpo , which is the time spent transfering the required parameters from the software to the hardware sub-system; Tco , which is the time spent transfering control from the software to the hardware sub-system; Tex , which is the execution time of the critical region in hardware sub-system; Tci , which is the time spent transfering control from the hardware sub-system to the software; Tpi , which is the time spent transfering the required parameters from the hardware sub-system to the software. The times Tpo and Tpi are related to the requirements of the critical region and the interface between the software and the hardware sub-systems. Tco and Tci are related to the handshake protocol and the interface between the two sub-systems. Time Tex can, in turn, be subdivided into two elements: Tiops , which is the time the hardware sub-system requires to execute the internal operations; Tios , which is the time the hardware sub-system requires to execute the input/output operations. The examples considered perform only memory accesses as input/output operations. From the results obtained, we can see that the time associated with the coprocessor memory accesses is a key factor in the performance of the co-design system. The characteristics of the hardware function (memory accesses) and those of the target architecture (bus interface, handshake protocol, bus arbitration) play a strong role in the performance of the co-design system. In order to determine the most suitable mechanism that will provide the best system performance, the number of coprocessor memory accesses and the memory access rate need to be analyzed. This is the subject of subsequent chapters.
50
3.4
The Co-design System
Summary
In this chapter, we have presented the co-design system developed at UMIST, including the adopted development route and the implemented target architecture. In the development route, we described the different stages a co-design implementation should pass through, from the system specification until its execution on the run-time system. This includes system profiling, partitioning, hardware synthesis and software compilation. The target architecture microcontroller is based on a shared bus configuration, using the MOTOROLA MC68332. Performance results were obtained for two benchmark programs, using the busy-wait and the interrupt interface mechanisms. One program performs more coprocessor internal operations than memory accesses, whilst the other performs more coprocessor memory accesses than internal operations. The performance results indicated that the speedup depends on the coprocessor memory accesses, together with the chosen interface mechanism. The remainder of this book will include a more systematically generated set of results, which will allow us to choose the most suitable interface mechanism for the application. In order to achieve this aim, we will proceed with our study using simulation method, which will allow us to examine the system in a variety of ways and in a much shorter time. In the next chapter, we present the VHDL model of the co-design system.
Summary
51
Notes 1 XNF stands for Xilinx Netlist Format. 2 LCA is a Xilinx trademark and a reference to the Xilinx FPGA architecture. 3 The BDM is an alternate CPU32 operating mode, during which normal instruction is suspended and special microcode performs debugging functions under external control. 4 JTAG/IEEE 1149.1 is an industry standard interface to support in-circuit reconfiguration and programming.
Chapter 4 VHDL MODEL OF THE CO-DESIGN SYSTEM
From the analysis undertaken in the previous chapter, we concluded that there is a relation between the number of coprocessor memory accesses and the interface mechanism, which determines the performance of the co-design system in terms of program execution times. The mechanism yields shorter execution times than the interrupt mechanism when the number of coprocessor memory accesses is small. On the other hand, as the number of coprocessor memory accesses increases, the interrupt mechanism becomes the better option, yielding shorter execution times compared to the busy-wait. Therefore, we decided to concentrate our attention on how to determine the best interface mechanism, based on the memory accesses of an application implemented by the coprocessor. This means analyzing the behavior of the execution times for different numbers and rates of memory accesses. From this point of view, we opted for a simulation procedure, instead of a physical experiment, because it allows us to analyze the system in a variety of ways and in a much shorter time. Besides this, through simulation we are able to inspect several levels of the design, which can help us in studying the behavior of the whole system and identifying potential bottlenecks. This chapter presents a simplified VHDL model of our physical co-design system, which was described and used in the previous chapter to implement our applications. Simulation results are presented and analyzed in the next chapter.
53
54
4.1
VHDL Model of the Co-design System
Modelling with VHDL
The VHSIC Hardware Description Language (VHDL) is an industry standard language used to describe hardware from the abstract to concrete level. In 1986, VHDL was proposed as an IEEE standard. It went through a number of revisions and changes until it was adopted as the IEEE 1076 standard, in December 1987 (IEEE, 1987). The use of Hardware Description Languages (HDL) in system design has a strong influence onto the design process and the used Computer Aided Design (CAD) tools. An HDL like VHDL enables the designer to describe a system at different levels of abstraction and to describe the different aspects of a design using a single, standardized language. Thus, the starting point of an automated design process has become closer to the level of reasoning of a system designer. The system can be described either in terms of states and actions triggered by external or internal events, sequential computations within a single process, communication between processes or structurally introduced existing hardware blocks. Extensive simulation of such a multi-level specification is used to validate, that the design does what is wanted. In the design process, we can identify three VHDL-based phases in use: system modeling (specification phase), register-transfer level (RTL) modeling (design phase) and netlist (implementation phase). Typically, the system model will be a VHDL model which represents the algorithm to be performed without any hardware implementation in mind. The purpose is to create a simulation model that can be used as a formal specification of the design and which can be run in a simulator to check its functionality. The system model is then transformed into a register-transfer level design in preparation for synthesis. The transformation is aimed at a particular hardware implementation but, at this stage, at a coarse-grain level. In particular, at this stage of the design process, the timing is specified at the clock cycle level. Also, the hardware resources to be used in the implementation are specified at the block level. The final stage of the design cycle is to synthesize the RTL design to produce a netlist, which should meet the area constraints and timing requirements of the implementation. We will model our co-design system using the RTL description, since we start from a physically implemented design. All the components are already defined as well as their interfaces. Therefore, the system model is not required at this level of the design.
Modelling with VHDL
4.1.1
55
Design Units and Libraries
Design units are the basic building blocks of VHDL (Perry, 1991; Rushton, 1995). A design unit is indivisible in that it must be completely contained in a single file (design file). A file may contain any number of design units. When a file is analyzed using a VHDL simulator or synthesizer, the file is, in effect, broken up into its individual design units and each design unit is analyzed separately as if it was provided in a separate file. The resulting analysis is another file (library unit), that is inserted into a design library. There are five kinds of design units in VHDL: entity, architecture, package and configuration. The design units are further classified as primary or secondary units. A primary design unit can exist on its own. A secondary design unit cannot exit without its corresponding primary unit. The entity is a primary design unit that defines the interface to a circuit. Its corresponding secondary unit is the architecture that defines the contents of the circuit. There can be many architectures associated with a particular entity. The package is also a primary design unit. A package declares types, subprograms, operations, components and other objects which can then be used in the description of a circuit. The package body is the corresponding secondary design unit that contains the implementations of subprograms and operations declared in this package. The configuration declaration is a primary design unit with no corresponding secondary. It is used to define the way in which a hierarchical design is to be built from a range of subcomponents. In a structural description, we declare components and create instances of components. The binding of each component to its corresponding entity is done through a configuration declaration. There are two special libraries which are implicitly available to all design units and, so, do not need to be named in a library clause. The first of these is called WORK and refers to the working design library into which the current design units will be placed by the analyzer. The second special library is called STD and contains the packages standard and textio (Rushton, 1995). The package standard contains all of the predefined types and functions and the package textio contains the built-in procedures for performing text I/O, providing enough functionality to read data files, for example.
4.1.2
Entities and Architectures
An entity defines the interface to a circuit and the name of the circuit. It specifies the number, the direction and the type of the ports. An architecture defines the contents of the circuit itself. Entities and architectures therefore exist in pairs. It is possible to have an entity without an architecture, but it is not possible to have an architecture without an entity. An example of an entity is given below for a reset-set flip-flop:
56
VHDL Model of the Co-design System
ENTITY rsff IS PORT (set, reset: IN q,qb: BUFFER END rsff;
bit; bit);
The circuit rsff has four ports: 2 input ports set and reset, and 2 output ports q and qb. Notice that the output ports are declared as BUFFER, instead of OUT, because in this case they need to be read as well as modified. A description for the corresponding architecture, using concurrent statements, is as follows: ARCHITECTURE behavior OF rsff IS BEGIN q <= NOT (qb AND set) AFTER 2 ns; qb <= NOT (q AND reset) AFTER 2 ns; END behavior; Another way of describing an architecture for the same entity is through sequential statements, as below: ARCHITECTURE behavior OF rsff IS BEGIN PROCESS (set, reset) BEGIN IF set = ’1’ AND reset = ’0’ THEN q <= ’0’ AFTER 2 ns; qb <= ’1’ AFTER 2 ns; ELSIF set = ’0’ AND reset = ’1’ THEN q <= ’1’ AFTER 2 ns; qb <= ’0’ AFTER 2 ns;
Modelling with VHDL
57
ELSIF set = ’0’ AND reset = ’0’ THEN q <= ’1’ AFTER 2 ns; qb <= ’1’ AFTER 2 ns; END IF; END PROCESS; END behavior; The architecture contains only one statement, called a process statement. All the statements between the keywords PROCESS and END PROCESS are part of the process statement and executed sequentially. The list of signals in parentheses after the keyword PROCESS is called the sensitivity list, which enumerates the signals that will cause the process to be executed.
4.1.3
Hierarchy
The natural form of hierarchy in VHDL is the component. Any entity/ architecture pair can be used as a component in a higher level architecture. When using hierarchy, other components can be incorporated into a design more easily. Each subcomponent can be designed and tested before being incorporated into the higher levels of the design. Useful subcomponents can be collected together into reusable libraries so that they can be used elsewhere in the same design and later in other designs. The description of the reset-set flip-flop, presented previously, can be done through the use of two NANDS. The new architecture can be described as: ARCHITECTURE structure OF rsff IS COMPONENT nand2 PORT (a, b: IN bit; c: OUT bit); END COMPONENT; BEGIN U1: nand2 PORT MAP (set, qb, q); U2: nand2 PORT MAP (reset, q, qb); END structure;
58
VHDL Model of the Co-design System
The component declaration (between ARCHITECTURE and BEGIN) describes the interfaces to the component. The component instantiation (U1 and U2) creates an instance of the component in the model. This type of VHDL representation is called a structural model, since it has components instantiations in it. Nevertheless, the name of the architecture is not important and only used to identify the different possible architecture descriptions for the same entity. The configuration statement maps component instantiations to entities. In the architecture structure above, the configuration statement would have the following description: CONFIGURATION rsff_config OF rsff IS FOR structure FOR U1, U2: nand2 USE ENTITY WORK.nand2(behavior); END FOR; END FOR; END rsff_config; The role of the configuration statement is to define exactly which architecture to use for every component instance in the model. The configuration statement shown above is identified as rsff config for entity rsff, using the architecture structure. The above configuration means: for the two component instances U1 and U2 of type nand2, instantiated in the structure architecture, use entity nand2, with architecture behavior, from the library WORK. It is not our intention to describe the VHDL language, but only to introduce some of its characteristics. In the following sections, more details will be introduced, as we describe our model.
4.2
The Main System
The model is based on the co-design system discussed in Chapter 3 and shown in Figure 4.1. Each module is implemented as a separate component, with its entity/architecture pair, which may include other subcomponents. The main system is described as the top-level entity in the design hierarchy and so its architecture consists basically of components declarations, components instantiations and configuration statements. The connections between the components are expressed in terms of the signals declared in the architecture. No external connections are needed and, hence, the entity has no ports. The complete description of the main system model is provided in Appendix B. The microcontroller module corresponds to a simplified version of the MC68332 (MOTOROLA, 1990a; MOTOROLA, 1990b) providing instruction execution (sequencer), memory read/write control and bus arbitration (bus arbiter) only. It does not implement either instruction fetch or instruction
59
The Main System
Figure 4.1. Main system configuration
decode, since we are only concerned with the coprocessor memory accesses. The instruction’s microcode, corresponding to parameter and control passing, are implemented as a sequence of VHDL commands. The main controller, address multiplexer and timer modules correspond directly to the ones implemented by the Intel FLEXlogic FPGAs, as described in Section 3.2.3 and Section 3.2.6. The FPGAs configuration was obtained from r 2 (Intel, 1993). For our compilation of their specification written in PALASM modeling purpose, the translation into VHDL was quite straightforward. The r 2 into VHDL is presented in Appendix C. translation process from PALASM The DRAM 16 module is divided into two other modules, each one modeling a 1M × 8 bits DRAM. The coprocessor board module is divided into two further modules: the bus interface and the coprocessor. Figure 4.2 shows the interconnections between them and the main system bus. The bus interface is implemented by an Intel r 2. FLEXlogic FPGA, whose configuration was described using PALASM Translation into VHDL is identical to that adopted for the address multiplexer, main controller and timer units. The coprocessor is implemented by a Xilinx FPGAXC4010D (Xilinx, 1992; Xilinx, 1993), whose configuration is generated by the synthesis tools, as described in Section 3.1.3. The coprocessor can itself be divided into three modules: clock divider, data buffers and accelerator. The accelerator implements the hardware function, translated from C into VHDL during partitioning (see Section 3.1.2). The clock divider and data buffers were previously expressed in the Xilinx Netlist Format (XNF) and had to be modeled in VHDL from their initial requirements.
60
VHDL Model of the Co-design System
Figure 4.2. Coprocessor board components
Figure 4.3. Logic symbol for the microcontroller
Considering that the components specified with PALASM are directly translated into VHDL and that the accelerator is already specified in VHDL, only the microcontroller, DRAM, clock/reset generator, coprocessor clock divider and data buffers will be modeled in VHDL from their requirements. The following sections describe each component. We use logic symbols to show the components interface signals, which translate into their entities ports. The component functionalities are described by algorithmic state machines, when related to controllers, and flowcharts, otherwise.
4.3
The Microcontroller
The microcontroller component, shown in Figure 4.3, consists of a simplified model of the MC68332. The external signals, shown in the logic symbol, are listed in the entity port declaration. The model comprises of a clock/reset generator, sequencer, bus arbiter and memory read/write processes, described in the following sections. In the architecture declarative part, internal signals are declared for the inter-processcommunication. An instruction cycle corresponds to the time required for fetching, decoding and executing an instruction, and can consist of one or more machine cycles.
The Microcontroller
61
Each machine cycle consists of a memory read/write or bus cycle. A bus cycle comprises a minimum of 3 clock cycles, in which each half clock cycle corresponds to a state. So, we can see that a bus cycle consists of 6 states, each one with some specific task, to be discussed in Section 4.3.4. As a matter of fact, the microcontroller is always executing an instruction cycle. For our special purpose, we do not model either instruction fetching or instruction decode as explained above. Instruction execution is implemented by the sequencer process as a sequence of VHDL commands, described in Section 4.3.2. The control flow of execution of a memory read/write cycle is implemented by the memory read/write process, described in Section 4.3.4. Control signals are asserted or negated in each state, according to the type of the operation (read or write), the data size (byte, word, long word) and the input/output port size (8 bits or 16 bits), which consists of the data size handled by external device. The microcontroller has the lowest priority in accessing the main system bus. Therefore, whenever a bus operation is required, the bus arbiter, described in Section 4.3.3, checks whether there is another device using the bus (Nbgack = 0) or requiring the bus (Nbr = 0). In either case, the bus arbiter waits until the bus is free and there is no other request pending before allocating the bus to the microcontroller. The clock/reset generator is implemented as a component, as part of the internal configuration of the microcontroller. It is responsible for the generation of the main system clock (clk) and the main system reset (Nreset) signals.
4.3.1
Clock and Reset Generator
This component is responsible for the generation of the main system clock (clk) and the main system reset (Nreset) signals. The constant clk period has been assigned to 60ns, in order to model the 16.78MHz clock frequency. Figure 4.4 shows the VHDL model for the clock and reset generator.
4.3.2
Sequencer
The sequencer models only the microcontroller instruction execution, through a sequence of VHDL commands. For our simulation purposes, we are concerned with the execution of the modified C source program, as described in Section 3.1.2. Parameter passing is application-dependent. So, we will describe the “hardware” call/return mechanism execution, since it depends on the interface only. Figure 4.5 shows the execution of the C source function start(), in which the microcontroller writes h0083 into the memory-mapped 8-bit control register (see Section 3.2.4) of the coprocessor, whose address is h600001. This is equivalent to asserting the coprocessor signals iack di, copro st and Nprogram, which correspond to bits 7, 1 and 0 respectively, of the control register. The first
62
VHDL Model of the Co-design System
ENTITY clock_gen IS PORT (clk: OUT bitz; Nreset: OUT bitz); END clock_gen; ARCHITECTURE behavior OF clock_gen IS BEGIN reset: Nreset <= ’0’, ’1’ AFTER 5*clk_period; clock: PROCESS BEGIN LOOP clk <= ’1’; WAIT FOR clk_period/2; clk <= ’0’; WAIT FOR clk_period/2; END LOOP; END PROCESS clock; END behavior; Figure 4.4. VHDL model for the clock and reset generator
one disables interrupt from the coprocessor, which must be the case if using the busy-wait interface; otherwise, bit 7 should be set to 0. The second one starts the coprocessor operation and the third one suspends the configuration of the FPGA, which is not implemented for simulation purposes and should be kept always in 1. In the first loop, the sequencer waits for the memory read/write process to become ready to accept a new operation (op done = 0). The sequencer then configures the necessary signals and requests a new operation to be performed by the memory read/write process (perform op = 1). In the second loop, the sequencer waits the memory read/write process to complete the operation (op done = 1). As a matter of fact, op done and perform op implement the handshake protocol between the two processes. The internal signal func code corresponds to the external function code signal f c2 : 0, indicating the address space of the current bus cycle (fc = 001). The internal signal data size corresponds to the external data size signal size1 : 0, indicating the data being transferred. The internal signals upperword and lower-word correspond to the higher-order and lower-order words, respectively, of the 32-bit internal data. Notice that the external data bus is of 16 bits (data15 : 0). The internal signal addr loc is configured with the address of the required location, corresponding to the external address bus addr23 : 0.
The Microcontroller
63
Figure 4.5. Writing into the coprocessor control register
Signal rNw op indicates whether the requested operation is a read (rNw op = 1) or a write (rNw op = 0), corresponding to the external signal rNw. Following the execution of function start(), we can have either the busy-wait loop, if we are using the busy-wait mechanism, or the interrupt procedure, if we are using the interrupt mechanism. The busy-wait loop corresponds to the statement while(!is finished()), in the C source code, and the interrupt procedure corresponds to the function wait for int(), in the C source code, as described in Section 3.2.4. In the busy-wait loop, shown in Figure 4.6, the microcontroller keeps checking bit 6 of the control register, which indicates whether the coprocessor has finished the operation (lower-word(6) = 1). During the iterations, only signal perform op has to be asserted to indicate a new reading of the control register, since it must be negated as soon as a read operation is complete, as part of the protocol between the sequencer and the memory read/write processes. The interrupt procedure is shown in Figure 4.7, when the microcontroller waits for an interrupt request (Nirq(0) = 0). When the interrupt request signal
64
VHDL Model of the Co-design System
Figure 4.6. The busy-wait model
is asserted, the microcontroller starts the interrupt acknowledge cycle. In the physical implementation, this operation takes 179 clock cycles, which is modeled by the loop with 170 iterations1 . The microcontroller does not provide an interrupt acknowledge signal, which had to be generated by software in the physical implementation (see Section 3.2.4). In our model, the interrupt acknowledge signal (iack) is sent through bit 7 of the control register (lowerword(7) = 1). First, the control register is read, so that all the other bits are kept unchanged once the write operation is performed. The stage following either the busy-wait loop or the interrupt procedure, is the execution of the C source code corresponding to the function ack stop() (see Section 3.2.4), shown in Figure 4.8. This function completes the handshake protocol between the microcontroller and the coprocessor, negating signal Ncopro st. Therefore, we write h0081 into the control register, resetting bit 1, which controls Ncopro st. The sequencer model is implemented as a process, without sensitivity list and, thus, controlled by “wait” statements on the main system clock (WAIT ON clk). The iterations described previously are implemented by “loop” statements, such as: LOOP EXIT WHEN op_done = ’1’; WAIT ON clk; END LOOP;
65
The Microcontroller
Figure 4.7. The interrupt routine model
4.3.3
Bus Arbiter
The bus arbiter is responsible for the microcontroller internal bus arbitration, indicating to the memory read/write process whether it can access the main system bus. It checks the bus request (Nbr) and the bus grant acknowledge (Nbgack) signals in order to decide if the bus is free and checks signal Nbus cycle in order to identify if the memory read/write process has taken
66
VHDL Model of the Co-design System
Figure 4.8. Completing the handshake, by negating N copro st
control of the main system bus. It controls the external bus grant signal Nbg and the internal signals Ntri state and Nbus av. Basically, the bus arbiter implements the protocol defined in (MOTOROLA, 1990b), as shown in Figure 4.9. It is implemented as a process controlled by the system clock (clk). Notice that the microcontroller has the lowest priority in accessing the main system bus. In the initial state 0 of the state machine of Figure 4.9, the bus arbiter keeps the bus grant signal negated (Nbg = 1) and indicates to the memory read/write process that the main system bus is free (Ntri state = 1 and Nbus av = 0). If the bus is requested by an external device (Nbr = 0) and if it is not granted already to another one (Nbgack = 1) and if the memory read/write process is not accessing the bus (Nbus cycle = 1), the arbiter enters state 5 in order to grant the bus. On the other hand, if the arbiter finds out that the bus is being used (Nbgack = 0), but not by the memory read/write process (Nbus cycle = 1), it enters state 3 in order to signal to the memory read/write process that it has to keep the external signals in the high impedance state, without granting the bus (Nbg = 1). In state 5, the bus arbiter indicates to the memory read/write process that the bus is not available (Nbus av = 1) and that it must keep the external signals in the high impedance state (Ntri state = 0). The bus is then granted to the
The Microcontroller
67
Figure 4.9. Algorithmic state machine for the bus arbiter
requesting device. If the bus is not being requested any more (Nbr = 1) and the requesting device has not acknowledged (Nbgack = 1), or if the requesting device has taken the bus (Nbgack = 0), the arbiter enters state 2 in order to withdraw the bus grant signal. In state 3, the bus arbiter indicates to the memory read/write process that the external signals are not available (Nbus av = 1) and that they must be kept in the high impedance state (Ntri state = 0). If the bus is requested, the arbiter enters state 6 in order to grant the bus. Otherwise, if the requesting device releases the bus (Nbgack = 1), the arbiter returns to its initial state 0.
68
VHDL Model of the Co-design System
In state 2, the bus arbiter withdraws the bus grant signal (Nbg = 1), while indicating to the memory read/write process that the bus is not available (Nbus av = 1) and that the external signals must be kept in the high impedance (Ntri state = 0) state. If the bus is not being requested (Nbr = 1) and the requesting device has released the bus (Nbgack = 1), the arbiter returns to the initial state 0. If the bus is either being requested (Nbr = 0) or the requesting device took the bus (Nbgack = 0), the arbiter enters state 3. The bus arbiter enters state 6 because more than one device requested the bus at the same time and one has been granted the bus. In this case, the bus request signal is still asserted (Nbr = 0). The bus is once more granted (Nbg = 0), allowing for any external arbitration circuit to decide which device is the next one to use the bus, while the bus is being used. The external arbitration circuit is required when more than one device is allowed to request the bus, which is the case in our co-design system, since two coprocessors can be connected to the main system bus. This circuit is implemented by the main controller, described in Section 3.2.3 and shown in the Figure 4.1. The arbiter indicates to the memory read/write process that the bus is not available (Nbus av = 1) and that the external signals must be kept in the high impedance state (Ntri state = 0). If the bus is still being requested (Nbr = 0) and any device has acknowledged (Nbgack = 1), the arbiter returns to state 1 in order to wait the requesting device to withdraw its request (Nbr = 1) or use the bus (Nbgack = 0). On the other hand, if the bus is not being requested any more (Nbr = 1), the arbiter returns to state 3 in order to withdraw the bus grant signal and then return to the initial state 0. Otherwise, it keeps signaling that the bus is granted (Nbg = 0), until a device acknowledges (Nbgack = 0) or withdraws the bus request (Nbr = 1). Notice that the bus becomes available to the microcontroller (Nbus av=0, Ntri state=1) only when the arbiter identifies that the bus is free and there is no other device requesting it. The arbiter is modeled as a state machine, consisting of a “case” statement based on the state variable. Each state is implemented by a “when” statement on the possible values for the state variable. Inside each “when” statement, we have a sequence of signal assignments and conditions, corresponding to the operations related to that state. The state machine transitions from one state to another on the rising edge of the main system clock, using “wait” statements (WAIT UNTIL clk = ‘1’ AND clk’EVENT).
4.3.4
Memory Read and Write
As discussed in the previous section, the memory read/write process identifies the request of a read/write operation and controls all the necessary steps to execute it. Basically, the memory read/write process implements the
The Microcontroller
69
microcontroller bus operations, which consists of a minimum of 6 states, each one corresponding to a half clock cycle (MOTOROLA, 1990b). It is implemented as a process controlled by the system clock (clk). Figure 4.10 shows the algorithmic state machine for the memory read/write process, in which each state is directly related to one of the microcontroller’s states. Since each one corresponds to half clock cycle, they change state on different transitions of the clock: state 0 is entered on the rising edge of the clock, while state 1 on the falling edge of the clock, and so forth. The microcontroller bus is used in an asynchronous manner. The external devices can operate at clock frequencies different to the clock for the microcontroller. Therefore, the bus operation uses the handshake lines Nas (address strobe), Nds (data strobe), N dsack1 : 0 (data strobe acknowledge), Nberr (bus error) and Nhalt (suspends the bus activity) to control data transfers. Besides the interface with the sequencer, the memory read/write process requires information about the availability of the main system bus. This is done through the interface with the bus arbiter, consisting of the signals Ntri state and Nbus av. On the other hand, the memory read/write process indicates to the bus arbiter that it has taken control of the main system bus by asserting Nbus cycle. The memory read/write process remains in state 0 until a bus operation is requested by the sequencer (perform op = 1). If the bus is being requested by another device (Nbr = 0) or is not available (Nbus av = 1), the controller does not proceed to state 1. While in state 0, if the bus arbiter indicates that the main system bus is not available (Nbus av = 0) and it must be kept in high impedance (Ntri state = 0), or if the reset signal is asserted (Nreset = 0), the external signals are set to the high impedance state, i.e. addr <= Z. Once the bus is available, the memory read/write process concurrently configures addr, fc, rNw and size with the values provided by the sequencer, loads the data in the buffer (datamem) if the operation is a write, indicates to the bus arbiter that it has taken control of the main system bus (Nbus cycle = 0) and enters state 1. In state 1, since the address lines have been configured already, the address strobe signal is asserted (Nas = 0). If it is a read operation (rNw op = 1), the data strobe signal is asserted too (Nds = 0), indicating that the microcontroller is ready to read the data from the data bus. The memory read/write process then enters state 2. In state 2, if it is a write operation, the external data bus receives the appropriate configuration. In this case, if the data size is byte (size = 01), the higher-order and the lower-order byte of the main data bus receive the same value from the lower-order byte of the lower-order word of the data buffer (datamem 7 : 0). If the data size is word (size = 10), the main data bus receives the lower-order word of the data buffer (datamem15 : 0). If the data size is long word (size = 00), the main data bus first receives the higher-order word from the data
70
VHDL Model of the Co-design System
Figure 4.10. Algorithmic state machine for memory read/write
The Microcontroller
71
buffer (datamem31 : 16). The data size now indicates that a word is to be transferred and the operation proceeds to state 3, with another bus cycle being required (wr fin = F) as soon as the present one finishes. If it is a read operation, the above steps are skipped and the memory read/write process enters state 3. In state 3, if it is a write operation, the data strobe signal is then asserted (Nds = 0). If the data strobe acknowledge signals are not asserted (Ndsack = 11), it means that the external device is not ready to complete the operation and a wait state is introduced2 . This procedure is repeated as long as the data strobe acknowledge is not asserted. When the data strobe acknowledge signal is asserted (Ndsack = 10)3 , the memory read/write process enters state 4. During a read operation, the microcontroller latches the incoming data at the end of state 4. Since we use the falling edge of the clock to enter state 5, the memory read/write process latches the data only in this state. During a write operation, the microcontroller issues no new control signals in state 4. In state 5, during a write operation (rNw op = 0), if it is not a long word transfer (wr fin = T), the memory read/write process finishes the data transfer (op done = 1, Nbus cycle = 1); otherwise (wr fin = F), the write operation has not finished yet and the memory read/write process returns to state 1. During a read operation (rNw op = 1), when using a 16-bit port (Ndsack = 01), if the data size is long word (size = 00), the higher-order word of the data buffer (datamem31 : 16) latches the incoming data, the data size changes to word (size = 10) and the memory read/write process enters state 0a to update the address, instead of state 04 ; if the data size is word (size = 10), the lower-order word of the data buffer (datamem15 : 0) latches the incoming data and the transfer finishes (op done = 1, Nbus cycle = 1), and the process returns to the initial state 0; if the data size is byte (siz = 01) and the address is even (addr(0) = 0), the lower-order byte of the data buffer (datamem7 : 0) latches the higher-order byte of the incoming data and the transfer finishes (op done = 1, Nbus cycle = 1); if the data size is byte (size = 01) and the address is odd (addr(0) = 1), the lower-order byte of the data buffer latches the lower-order byte of the incoming data and the transfer finishes (op done = 1, Nbus cycle = 1). In any of the cases above, the address strobe (Nas) and the data strobe (Nds) signals are negated, since the present bus cycle has finished. Notice that they are set to high impedance only when the process returns to the initial state 0. This means that the data transfer is complete. The memory read/write process is modeled as a state machine, consisting of a “case” statement based on the state variable. Each state is implemented by a “when” statement on the possible values for the state variable. Inside each “when” statement, we have a sequence of signal assignments and conditions, corresponding to the operations related to that state. The state machine transitions from one state to another either on the rising edge of the main system clock (WAIT UNTIL clk = ‘1’ AND clk’EVENT) or on the falling edge of the main
72
VHDL Model of the Co-design System
Figure 4.11. Logic symbol for the DRAM 16
system clock (WAIT UNTIL clk = ‘0’ AND clk’EVENT). This alternation is related to the transition from one machine state to another, which corresponds to half clock cycle: the process enters one machine state in the rising edge of the clock and the following state in the falling edge of the clock.
4.4
The Dynamic Memory: DRAM
The 2MB DRAM, shown in Figure 4.11, is implemented by two 1MB DRAMs. The address lines (addr mux9 : 0) are time multiplexed into a row address and column address. The row address corresponds to the higherorder 10 bits of the address bus and the column address corresponds to the lower-order 10 bits of the address bus. The row address is latched when the row address strobe is asserted (Nras = 0). The column address is latched when the column address strobe is asserted (Ncas = 0). In order to allow for byte access, each DRAM can be selected individually, using its column address strobe. Therefore, the lower-order byte is accessed when Ncas0 = 0 and the higher-order byte is accessed when Ncas1 = 0. On the other hand, during a word access, both column address strobes must be active. Figure 4.12 shows the algorithmic state machine corresponding to one 1MB DRAM. The memory remains in the initial idle state until either the row address strobe signal or the column address strobe signal is asserted. If the row address strobe is asserted (Nras = 0), the memory latches the row address and enters the state ras bef cas. If the column address strobe is asserted (Ncas = 0), it means a refresh cycle is starting: the column address is latched and the memory enters state cas bef ras5 . During the initial idle state, the memory keeps the data lines in the high impedance state. In state ras bef cas, if the row address strobe is negated (Nras = 1), the memory returns to the initial state idle. Otherwise, it remains in this state until the column address strobe is asserted. Once the column address strobe is asserted (Ncas = 0), the column address is latched: if it is a read operation, the data lines receive the contents of the memory location pointed to by the row address (row addr) and column address (col addr); if it is a write operation,
The Dynamic Memory: DRAM
73
Figure 4.12. Algorithmic state machine for the DRAM
the memory location pointed to by row addr and col addr latches the incoming data. After performing the operation, the memory enters state ras and cas. The memory remains in state ras and cas until the column address strobe is negated (Ncas = 1). Once this has taken place, if the row address strobe is still asserted (Nras = 0), the memory returns to state r¨as bef cas. This means that the next memory access is in the same page as the present one. On the other hand, if the row address strobe is negated (Nras = 1), the memory returns to the initial idle state. As soon as the column address strobe is negated (Ncas = 1), the data lines go to the high impedance state. In state cas bef ras, the memory starts a refresh cycle. If the column address strobe is negated (Ncas = 1), the memory returns to the initial state idle.
74
VHDL Model of the Co-design System
Figure 4.13. Logic symbol of the coprocessor
Otherwise, it remains in this state until the row address strobe is asserted (Nras = 0), entering, the refresh state. The memory remains in the refresh state until the row address and the column address strobes are negated, returning, then, to the initial idle state. The memory refresh is now complete. The input clocks of the DRAM are the row and column address strobes. The controller is modeled as a state machine, consisting of a “case” statement based on the state variable. Each state is implemented by a “when” statement on the possible values for the state variable (idle, ras bef cas, cas bef ras, ras and cas, refresh). Inside each “when” statement, we have a sequence of signal assignments and conditions, corresponding to the operations related to that state. The state machine transitions from one state to another based on events on the two input clocks (ras, cas). Notice that, since the state machine consists of sequential statements, its implementation is inside a process statement, which in this case has a sensitivity list consisting of the two input clocks.
4.5
The Coprocessor
The coprocessor component, shown in Figure 4.13, consists of the clock generator, data buffers, the accelerator and some glue logic. The glue logic includes the microcontroller for input control signals, such as Ncopro st, based on the internal clock. When using the physical co-design system, the configuration of the coprocessor is given in the Xilinx Netlist Format (XNF) (Xilinx, 1993). For the simulation procedure, we specify the structure of the coprocessor in VHDL. The accelerator version is obtained through translation from C to VHDL of the selected critical regions of an application, as described in Section 3.1.2. The following subsections present the model for the coprocessor clock generator and data buffers.
The Coprocessor
75
Figure 4.14. Logic symbol of the coprocessor clock generator
Figure 4.15. Flowchart for the coprocessor clock generator
4.5.1
Clock Generator
In the physical implementation, the coprocessor runs at half the microcontroller’s speed. This is due to the Xilinx FPGA internal delays (see Section 3.2.5). Therefore, an internal clock (copro clk), based on the main system clock, is required. Figure 4.14 shows the logic symbol of the coprocessor clock generator. The frequency of copro clk is half that of clk, as described in the flowchart of Figure 4.15. The clock generator is implemented as a process, based on “if” statements and having a sensitivity list consisting of the main system clock (clk). This means that the process is executed every time an event in clk takes place.
76
VHDL Model of the Co-design System
Figure 4.16. Logic symbol of the coprocessor data buffers
4.5.2
Coprocessor Data Buffers
Input and output data buffers are required to control the data flow between the main system data bus (data15 : 0) and the coprocessor internal data buses (datain31 : 0, dataout31 : 0). Figure 4.16 shows the logic symbol of the coprocessor data buffers. The data buffers allow for the coprocessor bus transfers. There are two 32-bit buffers: datain, for the incoming data, and dataout, for the out-coming data. Signal den3 : 0 is used to determine which part of the coprocessor internal data bus is connected to the main system data bus, during a bus transfer. Notice that we must also allow for byte transfer too. According to the bus operation for byte transfer, data can be configured either in the lower-order byte (odd addresses) or in the higher-order byte (even addresses) of the main system data bus. Signal FPGA rddir is generated by the coprocessor to control the data flow. During parameter passing, the microcontroller specifies the type of the operation: if it is a write, signal FPGA rddir goes low; if it is a read, signal FPGA rddir goes high. During a coprocessor memory operation, it is the coprocessor which specifies the type of the operation: if it is a write, signal FPGA rddir goes high; if it is a read, signal FPGA rddir goes low. Figure 4.17 shows the flowchart for the data input control, in which each bit of signal den3 : 0 is used to select the buffers, according to the size of the data transferred. The input control is implemented as a process, which is controlled by signals den and data. Figure 4.18 shows the flowchart for the data output control. Besides signal den3 : 0, we use signal FPGA rddir to enable the output buffers during a write operation and disable the output buffers during a read operation (data15 : 0 = ’Z’). The output control is implemented as a process, with a sensitivity list consisting of dataout, den3 : 0 and FPGA rddir, and based on “if” statements.
77
Summary
Figure 4.17. Flowchart for the coprocessor input buffer control
4.6
Summary
This chapter presented the VHDL model of the physical co-design system, described in Chapter 3. The purpose of this model is to allow the analysis of the co-design system through simulation, as proposed in the previous chapter. The description of the system in VHDL was obtained, in most part, from its original description in PALASM, whilst for some components, the VHDL specification was obtained from their requirement specifications. While preserving the same functionality between the two implementations, the delays naturally present in the physical implementation are not taken into account in the model. The whole system model was validated using the LEAPFROG simulator and the CWAVES wave viewer (Systems, 1995), through which different stimuli were introduced and the appropriate behavior obtained. In the following chapter, the co-design system model will be used for a set of simulation-based performance studies. These simulations aim to identify the relation between the coprocessor memory accesses and the two interface
78
VHDL Model of the Co-design System
Figure 4.18. Flowchart for the coprocessor output buffer control
mechanisms. In later chapters, different memory configurations will be introduced and some of the modules defined in this chapter will be modified accordingly.
Summary
79
Notes 1 The number of iterations is based on the interrupt acknowledge cycle from the development system. 2 Recall that a state corresponds to half clock cycle and so is each wait state. 3 With 16-bit ports, Ndsack must be configured with 01, whilst with 8-bit ports, Ndsack must be configured with 10. 4 Another bus cycle is necessary, but as a continuation of the present one, which means that the main bus is kept under the microcontroller control. This would be lost in case we return to state 0. 5 We use the Ncas before Nras method of memory refresh.
Chapter 5 SHARED MEMORY CONFIGURATION
The original co-design system is based on a shared memory architecture, which is commonly used in many co-design environments (R. Ernst and Benner, 1993; D. E. Thomas and Schmit, 1993; N. S. Woo and Wolf, 1994). Two interface mechanisms are provided: busy-wait and interrupt. Integer and pointer parameters are passed to and from the coprocessor via memory-mapped registers, and data arrays are stored in the shared memory. The execution time of two programs PLUM and EGCHECK were measured, on the co-design system, for both interface mechanisms, in order to determine if any noticeable difference in their performance could be detected (L. de M. Mourelle and Forrest, 1996). PLUM has more coprocessor internal operations than memory accesses, with some bus contention, showing a better speed-up with the busy-wait mechanism. EGCHECK has more coprocessor shared memory accesses than internal operations, with significant bus contention, showing better speed-up with the interrupt mechanism. So, the busy-wait mechanism is better than the interrupt mechanism when the coprocessor performs more internal operations than memory accesses. On the other hand, the interrupt mechanism is more suitable when there are more memory accesses than coprocessor internal operations. In order to determine the most adequate mechanism, which will provide the best system performance of a range of applications, the coprocessor memory accesses must be analyzed. A specific example is proposed and a simulation procedure is followed, based on our VHDL model of the co-design system that will enable us to study details of the architecture, such as bus arbitration, handshake protocol and parameter passing.
81
82
5.1
Shared Memory Configuration
Case Study
The C program of Figure 5.1 contains the function example to be synthesised in hardware, whose body is based on one main loop, containing two others. The first internal loop executes an internal operation, consisting of the increment of the local variable temp. In this loop, the number of iterations is controlled by the parameter operations, thus allowing us to control the number of coprocessor internal operations. The second internal loop executes a memory write, consisting of assigning the value of temp to the array table (implemented in the shared memory), at position count. In this loop, the number of iterations is controlled by the parameter accesses, thus allowing us to control the number of memory accesses to the same location. The outermost loop, controlled by the parameter iterations, allows us to generate a sequence of internal operations and memory accesses. Notice that, while accesses controls the number of coprocessor memory accesses to the same location (count does not change, while the second loop is being executed), iterations controls the number of coprocessor memory accesses to different locations (count is directly controlled by iterations). These parameters allow us to execute the same function in several different ways, representing the diversity of possible situations in real applications. The formal parameters and local variables are implemented in the coprocessor, while the array is in the shared memory, pointed to by the parameter table. This follows the call-by-reference procedure used in C. During synthesis, the partitioning tool (see Section 3.1.2) takes the C source program and generates two outputs: the VHDL specification of the function example, shown in Appendix D, and the modified C program, which is shown in Figure 5.2. Comparing the C source program with the modified one, we notice that the declaration part and the body of the function differ. The declaration part of the modified C program contains the necessary specifications for passing parameters to the coprocessor. Since parameter passing is done via memorymapped registers (see Section 3.2), the first assignments consist of defining the addresses of each parameter, according to their sizes. Pointer inst (0x680003) contains the address of the coprocessor register, which in turn contains the identification of the function, since more than one function, in the same program, can be synthesised. Pointer param0 (0x680004) contains the address of the coprocessor register that contains the address of the first element of the array table. Pointer param1 (0x68000A) contains the address of the coprocessor register, that contains the parameter iterations. Pointer param2 (0x68000E) contains the address of the coprocessor register, that contains the parameter operations. Pointer param3 (0x680012) contains the address of the coprocessor register that contains the parameter accesses. Once the addresses are assigned, the next step is to pass each parameter accordingly.
83
Case Study
typedef short int array1 [10000]; void example (array1 table, short int iterations, short int operations, short int accesses) { short int count, index; short int temp = 0; for (count = 0; count < iterations; count++){ for (index = 0; index < operations; index++) temp += 1; for (index = 0; index < accesses; index++) table [count] = temp; } } array1 table; short int interations = 10; short int operations = 1; short int accesses = 1; main () { example (table, iterations, operations, accesses); exit (0); } Figure 5.1. C program of example
The following three commands implement the “hardware” call/return mechanism, discussed in Section 3.2.4. Notice that, our example is using the busy-wait mechanism (“while (!is finished())”). The VHDL specification of the function example will be used by the VHDL model in order to execute the function. The modified C source program would be compiled for execution by the microcontroller. Since our VHDL model for the microcontroller does not implement instruction fetch, decode and execution, the modified C program of Figure 5.2 will be used as a reference only, to obtain the addresses related to each of the parameters. Parameter passing effectively starts by assigning 0 to inst. For each parameter, a sequence of VHDL commands is used to model the corresponding microcode, as described in Section 4.3.2. Figure 5.3 shows the part of the
84
Shared Memory Configuration
typedef short int array1 [10000]; void example (array1 table, short int iterations, short int operations, short int accesses) { unsigned char *const inst, short int** const param0; short int* const param1, param1, param3; inst = (unsigned char *) 0x680003; param0 = (short int**) (0x680004 + 4 - sizeof param1 = (short int*) (0x680008 + 4 - sizeof param2 = (short int*) (0x68000C + 4 - sizeof param3 = (short int*) (0x680010 + 4 - sizeof
(short int*)); (short int)); (short int)); (short int));
*inst = 0; *param0 = table; *param1 = iterations; *param2 = operations; *param3 = accesses; start (); while (!is_finished() ); ack_stop (); } array1 table; short int iterations = 10; short int operations = 1; short int accesses = 1; main () { example (table,iterations,operations,accesses); exit (0); } Figure 5.2. Modified C program
Timing Characteristics
85
microcontroller VHDL model (sequencer) associated to the transfer of the parameter table to the coprocessor. The first loop corresponds to the necessary handshake with the memory read/write process (see Section 4.3.2). Once the next operation is allowed, the appropriate internal signals are driven. Signal func code indicates that the current bus cycle is to the user address space (see Section 3.2.1). Since we do not pass table, but the address of the first element of table (param0), the size of the operand (data size) is 32 bits (long word). The microcontroller works internally with 32 bits and the parameter to be passed (0x200000) is then configured in the higher-order (upper word) and lower-order (lower word) words of the internal bus. The address of the corresponding coprocessor register (0x680004), obtained from the modified C program, is then assigned to the internal signal addr loc. Signals perform op and rNw op indicate to the memory read/write process (see Section 4.3.4) that a bus operation is required, consisting of a write operation. The following loop is part of the handshake between the sequencer and the memory read/write processes, when an operation was required. As soon as the operation is complete, signal perform op is then negated. The above lines are repeated for each parameter, changing the values of data size, upper word, lower word and addr loc, accordingly. The “hardware” call/return mechanism in the modified C program is executed by the microcontroller through a sequence of VHDL statements, as described in Section 4.3.2. The VHDL model of the co-design system will be used to systematically generate a set of results, based on different values of the parameters iterations, operations and accesses. Our model of the microcontroller provides a sequencer, bus interface and bus arbitrator, which allows the implementation of read and write operations. Since our aim is to analyze the relation between the coprocessor memory accesses and the execution times, this simplified model is quite enough.
5.2
Timing Characteristics
Before starting our analysis of the execution times, we must identify some of the timing characteristics involved. We consider the microcontroller shared memory accesses and coprocessor memory-mapped registers, such as the control register, and the coprocessor shared memory accesses. Since we are concerned with the coprocessor memory accesses, the times related to the bus arbitration must be studied. The times associated to the busy-wait and interrupt mechanisms must be analyzed too during a coprocessor memory access and handshake completion, in order to determine their influence on the function’s execution time. The MC68332 (MOTOROLA, 1990b) is a 32-bit integrated microcontroller, with a 24-bit address bus and 16-bit data bus. The microcontroller runs at 16.78MHz, which gives a main system clock cycle of 60ns, and the coprocessor
86
Shared Memory Configuration
LOOP EXIT WHEN op_done = ’0’; WAIT ON clk_mcu; END LOOP; func_code <= "001"; {byte(01) word(10) 3byte(11) long(00)} data_size <= "00"; upper_word <= "0000000000100000"; {0x0020} lower_word <= "0000000000000000"; {0x0000} {0x680004 (param0)} addr_loc <= "011010000000000000000100"; perform_op <= ’1’; rNw_op <= ’0’; LOOP EXIT WHEN op_done = ’1’; WAIT ON clk_mcu; END LOOP; perform_op
<= ’0’;
Figure 5.3. Passing parameter table to the coprocessor
at 8.39MHz, due to the Xilinx/FPGA internal delays (Xilinx, 1993). The times provided are based on the main system clock (16.78MHz). Some important times, related to shared memory accesses and memory-mapped register accesses, are given below, for the microcontroller and coprocessor: microcontroller writes into the control register (8 bits): 240ns (4 clock cycles) microcontroller reads from the control register (8 bits): 180ns (3 clock cycles) microcontroller shared memory read (16 bits): 240ns (4 clock cycles) microcontroller shared memory write (16 bits): 240ns (4 clock cycles) coprocessor shared memory read (16 bits): 300ns (5 clock cycles)
Timing Characteristics
87
coprocessor shared memory write (16 bits): 300ns (5 clock cycles) The difference between the duration of the microcontroller read and write operations on the control register is due to the microcontroller bus operation (see Section 4.3.4). We must state here that the times obtained for the coprocessor shared memory accesses do not take into account the bus arbitration, that necessarily takes place in this kind of shared bus configuration.
5.2.1
Parameter Passing
Parameter passing is done through memory-mapped registers. The first parameter is a 32-bit pointer to the array (table), which requires two bus accesses. The next parameters are 16-bit integers, requiring only one bus access each. Parameter inst is not one of the function’s formal parameters, but is part of the parameter passing process, consisting of 8 bits and taking one bus access too1 : microcontroller sends inst (8 bits): 540ns (9 clock cycles) microcontroller sends the first parameter table (32 bits): 1200ns (20 clock cycles) microcontroller sends the next parameters (16 bits each): 600ns (10 clock cycles each)
5.2.2
Bus Arbitration
In a shared memory architecture, there will be always the need for bus arbitration. When using the busy-wait mechanism, there might be bus contention, since the microcontroller might be in a busy-wait cycle. When using the interrupt mechanism, there is no bus contention, since the microcontroller is halted after starting the coprocessor2 . When the coprocessor requires a memory access, it asserts a signal named Nmemreq. Bus arbitration starts, then, when the bus arbitration controller acknowledges Nmemreq and asserts Nbr, 2 clock cycles later. Three signals are now involved in this operation: Nbr (bus request), Nbg (bus grant) and Nbgack (bus grant acknowledge). Before analyzing the associated times, we have to understand how the microcontroller deals with input signals. Input signals to the microcontroller are sampled on the falling edge of the clock and a corresponding internal signal is then updated half clock cycle later, as shown in Figure 5.4 for Nbr. The microcontroller bus arbitration is controlled by the rising edge of the clock, reading the internal state of Nbr on the next clock cycle only. The bus is granted 1 clock cycle later, if the bus is free. As for input signals, the output
88
Shared Memory Configuration
bus_request: PROCESS (clk_mcu, Nreset_mcu, Nbr) BEGIN IF (Nreset_mcu = ’0’) THEN Nbr_mcu <= ’1’; ELSIF (clk_mcu = ’0’ AND clk_mcu’EVENT) THEN Nbr_mcu <= Nbr; END IF; END PROCESS bus_request; Figure 5.4. Sampling of the bus request input signal (N br)
signal Nbg is asserted only on the next falling edge of the clock. The bus arbitration controller acknowledges this event by asserting Nbgack, on the next rising edge of the clock, and keeping it asserted until the coprocessor memory access is finished3 . The coprocessor starts the memory access as soon as Nbg is identified, on the rising edge of the clock, as can be seen from Figure 5.5. This diagram is derived from the VHDL simulation of the function example, using the tool CWAVES, available under the LEAPFROG simulator (Systems, 1995). So, it takes 2.5 clock cycles for the coprocessor to have the bus granted and 0.5 clock cycle to start the memory access. We can, therefore, conclude that the minimum time required for bus arbitration is 3 clock cycles, since we are not considering the bus contention.
5.2.3
Busy-Wait Mechanism
When using the busy-wait mechanism, we must relate to the microcontroller busy-wait cycle, which corresponds to the time interval between two consecutive readings of the control register, taking 5 clock cycles. Since it takes 3 clock cycles for the microcontroller to read the control register, 2 clock cycles are spent in preparing for the next one. If the microcontroller starts a bus cycle, the coprocessor will have to wait for the present bus cycle to finish (3 clock cycles), plus the minimum arbitration time (3 clock cycles), which will give a total of 6 clock cycles. We can say that the worst case is when the coprocessor requests the bus 1 clock cycle before a bus cycle starts, since Nbr will be identified 1 clock cycle later, when the bus is already in use, as shown in Figure 5.6. When the coprocessor finishes executing the function, it asserts the signal Ncopro dn, which is read during a busy-wait cycle, as shown in Figure 5.7. The microcontroller latches the data on the falling edge of the clock, during state 44 of a read cycle (MOTOROLA, 1990b). The worst case is when Ncopro dn is
Timing Characteristics
89
Figure 5.5. Bus arbitration without contention
asserted on the next rising edge of the clock (which means, by the end of state 5), requiring another busy-wait cycle, that will take 2 clock cycles to start. When the microcontroller identifies the end of a coprocessor operation, it starts a write cycle in order to negate Ncopro st, as part of the handshake protocol. Once the write cycle starts, it will take an additional 2 clock cycles to negate this signal. The coprocessor is waiting for Ncopro st in order to negate Ncopro dn and complete the handshake. So, it will take 9 clock cycles to complete the handshake. We can say that the best case is when Ncopro dn is asserted on the rising edge of the clock, in state 4, during a busy-wait cycle. It will take 1 clock cycle to finish the current bus cycle, 2 clock cycles to start the write cycle for completion of the handshake, and an extra 2 clock cycles to negate Ncopro st. So, it will take a total of 5 clock cycles to complete the handshake in this case. The discussion indicates some timing values related to two different situations: the first one when the coprocessor wants to make a memory access and
90
Shared Memory Configuration
Figure 5.6. Bus arbitration when using busy-wait
the second one when it finishes its operation. The following summarizes the times obtained: busy-wait cycle: 300ns (5 clock cycles); bus arbitration cycle: 180ns to 360ns (3 to 6 clock cycles); handshake completion: 300ns to 540ns (5 to 9 clock cycles).
5.2.4
Interrupt Mechanism
The interrupt mechanism implies that there is no bus contention, since the microcontroller is halted and there are no other active entities. Nevertheless, there is still the need for bus arbitration, but requiring the minimum time of 3 clock cycles, as shown in Figure 5.5 and discussed in the previous section.
Timing Characteristics
91
Figure 5.7. Handshake completion when using busy-wait
We can now identify the first difference between the two mechanisms, since in the busy-wait case the bus arbitration could take from 3 to 6 clock cycles. Another difference is the way in which the microcontroller identifies the end of a coprocessor operation. Once Ncopro dn is asserted, an interrupt request (Nirq0) is sent to the microcontroller 1 clock cycle later, as shown in Figure 5.8. The interrupt acknowledge cycle takes 179 clock cycles, corresponding to the time interval between the interrupt request and the interrupt request acknowledge. It then takes 4 clock cycles for the microcontroller to start the write cycle to complete the handshake, as shown in Figure 5.9, and 2 clock cycles more to negate Ncopro st. Adding these 6 clock cycles to the interrupt acknowledge cycle time, plus 1 clock cycle for the delay between Ncopro dn and Nirq0, we will have a total of 186 clock cycles to complete the handshake. Summarizing, we have the following timing characteristics for the interrupt mechanism:
92
Shared Memory Configuration
Figure 5.8. End of coprocessor operation with interrupt
bus arbitration cycle: 180ns (3 clock cycles); interrupt acknowledge cycle: 10740ns (179 clock cycles); handshake completion: 11160ns (186 clock cycles).
5.3
Relating Memory Accesses and Interface Mechanisms
The previous sections presented some basic timing characteristics, related to parameter passing, bus arbitration and handshake completion. When comparing the two interface mechanisms, we identify two major differences. The first one is the bus arbitration cycle that takes place before the coprocessor makes any memory access: bus arbitrationbw − bus arbitrationint = 0ns to 180ns
(5.1)
The second difference is the way in which the microcontroller identifies the end of a coprocessor operation and deals with it. This corresponds to the handshake completion: handshake completionint − handshake completionbw = 10620ns to 10860ns (5.2)
Memory Accesses and Interface Mechanisms
93
Figure 5.9. Handshake completion when using interrupt
Since our performance evaluation is based on execution times, the bus arbitration can contribute significantly to its increase, depending on the number of memory accesses performed by the coprocessor. Of course, parameter passing has some impact too, but only when the function is called. In order to identify how the memory accesses determine the best interface mechanism (de M. Mourelle and Edwards, 1997), which will provide the shortest execution times, we will use the example of Figure 5.1, for different parameter values. When changing iterations, we control the number of coprocessor internal operations and memory accesses; when changing operations, we control the memory access rate; and when changing accesses, we control the number of successive memory accesses. For convenience, we now provide the times in nanoseconds.
94
Shared Memory Configuration
Table 5.1. Performing 10 internal operations and a single memory access per iteration iterations Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) 16680 10860 – – 0 5820 1 11700 22440 10740 5880 5760 2 17580 28200 10620 5880 5760 3 23460 33960 10500 5880 5760 8 5760 52860 62760 9900 5880 9 58740 68520 9780 5880 5760 10 64620 74280 9660 5880 5760 20 123420 131880 8460 58800 57600 30 182220 189480 7260 58800 57600 80 476220 477480 1260 58800 57600 90 535020 535080 60 58800 57600 100 593820 592680 −1140 58800 57600 200 1181820 1168680 −13140 588000 576000 300 1769820 1744680 576000 −25140 588000
5.3.1
Varying Internal Operations and Memory Accesses
When varying iterations, we are changing the number of coprocessor internal operations and that of coprocessor memory accesses, at the same time. Table 5.1 shows the execution times for busy-wait (Tbw ) and interrupt (Tint ) as a function of iterations, performing 10 coprocessor internal operations (operations = 10) and 1 memory access per iteration (accesses = 1). The initial times Tbw (0) and Tint (0) correspond to the time necessary to run the coprocessor, without executing the function’s main loop (see Figure 5.1). This is the time spent in parameter passing, some coprocessor internal operations (to deal with the calculations of the loop controlled by iterations) and the handshake completion. The difference between these two initial times is directly related to the difference in the handshake completion, since the other tasks are independent of the interface mechanism and there is no bus arbitration5 : Tint (0) − Tbw (0) = 10860ns (181 clock cycle)
(5.3)
So, this result corresponds to the biggest difference between the handshake completions for the two interfaces (181 clock cycles). Since this is constant for the interrupt, we can conclude that the example applied provides the shortest handshake completion for busy-wait. There is 1 memory access at every 5880ns for the busy-wait and at every 5760ns for the interrupt. This time includes the bus arbitration, execution of 10 coprocessor internal operations, coprocessor memory write and coprocessor internal control calculations:
Memory Accesses and Interface Mechanisms
95
Tbw (iterations) = 5820 + (5880 × iterations)
(5.4)
Tint (iterations) = 16680 + (5760 × iterations)
(5.5)
As soon as the number of iterations increases, so does the number of coprocessor memory accesses6 . The frequency of bus contention increases in the case of the busy-wait mechanism, but there is no bus contention in the interrupt case. Therefore, the difference between the execution times for busy-wait and for interrupt decreases, as we can see from Figure 5.10. The total number of coprocessor memory accesses, for which the execution times of busy-wait and interrupt are the same is: Tint − Tbw = 0 ⇒ 10860 − (120 × iterations)
(5.6)
Tint = Tbw ⇒ iterations = 90.5
(5.7)
Equation (5.6) shows that the difference between the two interface mechanisms depends on the handshake completion and bus arbitration, as stated before. Table 5.2 shows another set of results based on the number of memory accesses (iterations), but performing 100 coprocessor internal operations and 1 memory access per iteration. By changing the number of coprocessor internal operations, we change the memory access rate. We can see that the initial times for both mechanisms are the same as in Table 5.1, for which we had 10 coprocessor internal operations. This is expected, since the function is not executed when iterations = 0, which means that any changes in operations or accesses will not have any effect in this case. The Equations (5.8) and (5.9) for the execution times can now be formulated: Tbw (iterations) = 5820 + (38280 × iterations)
(5.8)
Tint (iterations) = 16680 + (38160 × iterations)
(5.9)
There is one coprocessor memory access every 38280ns for the busy-wait and every 38160ns for the interrupt. This shows that the memory access rate depends on the number of coprocessor internal operations. The total number of coprocessor memory accesses, for which the execution times of busy-wait and interrupt are the same, is: Tint − Tbw = 0 ⇒ 10860 − (120 × iterations) = 0
(5.10)
Tint = Tbw ⇒ iterations = 90.5
(5.11)
Equation (5.10) is the same as Equation (5.6). This means that the difference between the two interface mechanisms does not depend on the memory access rate, but on the total number of memory accesses.
96
Shared Memory Configuration
Table 5.2. Performing 100 internal operations and one memory access per iteration iterations Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) – 0 5820 16680 10860 – 1 54840 10740 38280 38160 44100 2 82380 93000 10620 38280 38160 3 38280 38160 120660 131160 10500 8 312060 321960 9900 38280 38160 9 360120 9780 38280 38160 350340 10 388620 398280 9660 38280 38160 771420 779880 8460 382800 381600 20 30 1154220 1161480 7260 382800 381600 80 3068220 3069480 1260 382800 381600 90 3451020 3451080 382800 381600 60 100 3833820 3832680 −1140 382800 381600 200 7661820 7648680 −13140 3828000 3816000 300 11489820 11464680 −25140 3828000 3816000
Table 5.3. Performing 10 iterations and a single memory access per iteration operations Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) 0 28620 38280 9660 – – 1 31620 41880 10260 3000 3600 2 34620 45480 10860 3000 3600 3 38460 49080 10620 3840 3600 8 56460 67080 10620 3840 3600 9 60420 70680 10260 3960 3600 10 64620 74280 9660 4200 3600 20 100620 110280 9660 36000 36000 30 136620 146280 9660 36000 36000 80 316620 326280 9660 36000 36000 90 352620 362280 9660 36000 36000 100 388620 398280 9660 36000 36000 200 748620 758280 9660 360000 360000 300 1108620 1118280 9660 360000 360000
5.3.2
Varying the Coprocessor Memory Access Rate
Table 5.3 presents the execution times for busy-wait and interrupt, varying the number of coprocessor internal operations, performing 10 iterations and one memory access per iteration.
Memory Accesses and Interface Mechanisms
97
Figure 5.10. Graphical representation for Tbw and Tint , in terms of iterations
Although the difference between the execution times obtained for the interrupt are the same, this is not the case for busy-wait. This is because every time we include a new coprocessor internal operation, the coprocessor memory cycle changes, altering the bus arbitration time for the busy-wait, but not for the interrupt7 . Nevertheless, the execution times for the busy-wait start showing a more steady behavior when the number of coprocessor internal operations is greater than 10, corresponding to an average time. We saw in Section 5.3.1 that by changing the number of coprocessor internal operations, we also change the coprocessor memory access rate. From the results above, we confirm that the difference between the execution times for busy-wait and for interrupt can be considered constant, when only the number of coprocessor internal operations is changed. As stated in Section 5.3.1, we
98
Shared Memory Configuration
Table 5.4. Performing 10 iterations and 10 internal operations accesses 0 1 2 3 8 9 10 20 30 80 90 100
Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) 50220 61080 10860 – – 74280 9660 14400 13200 64620 77220 86280 9060 12600 12000 8460 12600 12000 89820 98280 152820 158280 5460 12600 12000 165420 170280 4860 12600 12000 178020 182280 4260 12600 12000 304020 302280 −1740 126000 120000 430020 422280 −7740 126000 120000 1060020 1022280 −37740 126000 120000 1186020 1142280 −43740 126000 120000 1312020 1262280 −49740 126000 120000
can again conclude that the difference between the execution times for busywait and the execution times for interrupt does not change, when changing the memory access rate.
5.3.3
Varying the Number of Coprocessor Memory Accesses
In the previous sections, we changed two of the function’s parameters: iterations and operations. Equations (5.6) and (5.10) are based on iterations only, since the number of coprocessor internal operations is of no importance when analyzing the difference between the two interface mechanisms. Let us now change the total number of memory accesses, by changing accesses, as presented in Table 5.4. The number of coprocessor memory accesses, for which the execution time of busy-wait and interrupt are the same, is: Tbw (accesses) = 50220 + (12600 × accesses)
(5.12)
Tint (accesses) = 61080 + (12000 × accesses)
(5.13)
Tint − Tbw = 0 ⇒ 10860 − (600 × accesses) = 0
(5.14)
Tint = Tbw ⇒ accesses = 18.10
(5.15)
Since these results are for 10 iterations, we conclude that the total number of coprocessor memory accesses is 181.0. Equations (5.12) and (5.13) indicate that each memory access adds 1260ns to the busy-wait execution time and 1200ns to the interrupt execution time. We observe that the difference between the busy-wait execution times when accesses = 0 and when accesses = 1 (1440ns) is bigger than the one obtained for the other values (1260ns). This
Memory Accesses and Interface Mechanisms
99
Figure 5.11. Graphical representation for Tb and Ti , in terms of accesses
is because, unlike the other values for accesses, when accesses = 0 there is no bus arbitration. The bus arbitration can vary from 3 to 6 clock cycles. The same happens to the interrupt execution times, except that the bus arbitration is always 3 clock cycles. Figure 5.11 shows how the execution times change with the number of coprocessor memory accesses per iteration. It is similar to the behavior presented in Figure 5.10, when iterations is changed, but not accesses. Equations (5.6) and (5.10) are the same, in spite of the number of coprocessor internal operations being different. So, we conclude that, unlike iterations and accesses, variable operations does not determine the behavior of the difference between the execution times for the two interface mechanisms. Nevertheless, these two equations are only based on iterations, since variable accesses is kept constant. We conclude that Equation (5.6) is a special case and a more complete equation must be formulated not only in terms of iterations, but also in terms of accesses. In order to do so, we will use the results obtained by varying iterations, for different values of accesses, and performing 10 coprocessor internal operations. Table 5.5 presents results when there are no memory accesses and the following equations can be formulated based on these results: Tbw (iterations) = 5820 + (4440 × iterations)
(5.16)
Tint (iterations) = 16680 + (4440 × iterations)
(5.17)
Tint − Tbw = 10860
(5.18)
100
Shared Memory Configuration
Table 5.5. Performing 10 internal operations and no memory accesses iterations Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) 0 5820 16680 10860 – – 1 10320 21120 10800 4500 4440 2 4440 14820 25560 10740 4500 3 19320 30000 10680 4500 4440 8 10680 4500 4440 41520 52200 9 46020 56640 10620 4500 4440 10 4440 50220 61080 10860 4200 20 94620 105480 10860 44400 44400 30 139020 149880 44400 44400 10860 80 361020 371880 10860 44400 44400 90 405420 416280 10860 44400 44400 100 449820 460680 10860 44400 44400 444000 444000 200 893820 904680 10860 300 1337820 1348680 10860 444000 444000
The coefficients for iterations in Equations (5.16) and (5.17) correspond to the time spent by the coprocessor in executing the function, without performing any memory access. In this case, there is no bus arbitration and the execution time is the same for both mechanisms. The difference between this times for busy-wait and interrupt is constant and related to the handshake completion, as given by Equation (5.18). Table 5.6 presents results for the coprocessor performing 2 memory accesses per iteration. From these results we can formulate the following equations, in terms of iterations: Tbw (iterations) = 5820 + (7140 × iterations)
(5.19)
Tint (iterations) = 16680 + (6960 × iterations)
(5.20)
Tint − Tbw = 0 ⇒ 10860 − (180 × iterations) = 0
(5.21)
Tint = Tbw ⇒ iterations = 60.33
(5.22)
Equation (5.21) is similar to Equation (5.6) in the sense that it shows the dependency on the handshake completion and the bus arbitration. Nevertheless, the difference is in the bus arbitration, which is directly related to the memory accesses performed. Table 5.7 presents some results by varying iterations and performing 3 memory accesses per iteration, keeping 10 coprocessor internal operations. We can now formulate the equations for Tbw and Tint , in terms of iterations, and obtain the number of iterations for which the two times are equal:
Memory Accesses and Interface Mechanisms
101
Table 5.6. Performing 10 internal operations and 2 memory accesses per iteration iterations Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) 16680 10860 – – 0 5820 1 13140 23640 10500 7320 6960 2 20100 30600 10500 6960 6960 3 27420 37560 10140 7320 6960 8 6960 62940 72360 9420 6960 9 70260 79320 9060 7320 6960 10 77220 86280 9060 6960 6960 20 148620 155880 7260 71400 69600 30 220020 225480 5460 71400 69600 80 577020 573480 −3540 71400 69600 90 648420 643080 −5340 71400 69600 100 719820 712680 −7140 71400 69600 200 1433820 1408680 −25140 714000 696000 300 2147820 2104680 696000 −43140 714000
Table 5.7. Performing 10 internal operations and 3 memory accesses per iteration iterations Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) 5820 16680 10860 – – 0 1 14220 24840 10620 8400 8160 2 22620 33000 10380 8400 8160 3 31020 41160 10140 8400 8160 8 73020 81960 8940 8400 8160 9 81420 90120 8700 8400 8160 10 89820 98280 8460 8400 8160 6060 84000 81600 20 173820 179880 30 257820 261480 3660 84000 81600 80 677820 669480 −8340 84000 81600 90 761820 751080 −10740 84000 81600 100 845820 832680 −13140 84000 81600 200 1685820 1648680 −37140 840000 816000 300 2525820 2464680 −61140 840000 816000
Tbw (iterations) = 5820 + (8400 × iterations)
(5.23)
Tint (iterations) = 16680 + (8160 × iterations)
(5.24)
Tint − Tbw = 0 ⇒ 10860 − (240 × iterations) = 0
(5.25)
Tint = Tbw ⇒ iterations = 45.25
(5.26)
The total number of coprocessor memory accesses in this case is 135.75. We can see that Equation (5.25) is similar to Equations (5.6) and (5.21), showing
102
Shared Memory Configuration
the dependency of the execution times on the handshake completion and bus arbitration. Since there are 3 memory accesses being performed per iteration, only the bus arbitration parameter changes. The aim now is to obtain an equation based on iterations and accesses, for each of the mechanisms: Tbw (iterations, accesses) and Tint (iterations, accesses). In order to do this, we will use Equations (5.4), (5.5), (5.19), (5.20), (5.23) and (5.24), including the additional parameter: Tbw (iterations, 1) = 5820 + (5880 × iterations)
(5.27)
Tbw (iterations, 2) = 5820 + (7140 × iterations)
(5.28)
Tbw (iterations, 3) = 5820 + (8400 × iterations)
(5.29)
Tbw (iterations, 2) − Tbw (iterations, 1) = 1260 × iterations
(5.30)
Tbw (iterations, 3) − Tbw (iterations, 2) = 1260 × iterations
(5.31)
Tint (iterations, 1) = 16680 + (5760 × iterations)
(5.32)
Tint (iterations, 2) = 16680 + (6960 × iterations)
(5.33)
Tint (iterations, 3) = 16680 + (8160 × iterations)
(5.34)
Tint (iterations, 2) − Tint (iterations, 1) = 1200 × iterations
(5.35)
Tint (iterations, 3) − Tint (iterations, 2) = 1200 × iterations
(5.36)
The result provided by Equation (5.30) is directly related to the way the coprocessor performs a memory access. The coprocessor is implemented as a state machine, based on a clock (copro clk), which is half the speed of the microcontroller clock, and changes state on the raising edge of this clock. The coprocessor sends an address on the raising edge of copro clk, which will be released to the main address bus on the raising edge of the main clock (clk), as soon as the bus is granted. In order to indicate a memory request, the coprocessor asserts mem req at the same time. In the next copro clk cycle, it waits until mem fin is asserted. This signal is directly related to Nmemfin, generated by the bus interface controller, once the coprocessor memory operation finishes. Input signals to the coprocessor are sampled on the falling edge of copro clk, in the same manner as the microcontroller does, but using clk. The bus interface controller asserts Nmemfin on the falling edge of clk a half clock cycle before the end of the coprocessor memory access. If Nmemfin is asserted a half clock cycle before a negative transition of copro clk, mem fin will be asserted 30ns later, which means at the same time the coprocessor memory access finishes, as it can be seen from Figure 5.128 . If Nmemfin is asserted a half clock cycle after a negative transition of copro clk, then mem fin will be asserted 90ns later, which means 60ns after the coprocessor memory access finishes. A new coprocessor memory access
Memory Accesses and Interface Mechanisms
103
Figure 5.12. Relation between N memf in and mem f in, for busy-wait
will take place 540ns after mem fin is asserted, which corresponds to the sequence of memory accesses performed inside the loop controlled by accesses (see Figure 5.1). Summarizing, we have the following times for every coprocessor memory access: coprocessor memory request (mem req) to bus request (Nbr): 120ns; bus arbitration: 180 to 360ns; coprocessor memory write: 300ns; delay in signalling end of coprocessor memory operation (mem fin): 0 to 60ns; signal mem fin to the next coprocessor memory request: 540ns. Therefore, the time required for each memory access would vary from 1140ns to 1380ns, giving an average of 1260ns, as provided by Equation (5.30). The
104
Shared Memory Configuration
same study must be done for Equation (5.35), which is related to the interrupt mechanism. The only difference is in bus arbitration, which takes 180ns and does not vary as in the busy-wait case. The time required for each memory access would vary from 1140ns to 1200ns, with an average of 1170ns. Our example presents the worst case possible, when using the interrupt mechanism, since the time obtained for each memory access is maximal, as mem fin is asserted 60ns after the end of the coprocessor memory access. We can deduce that the coefficient for iterations in the equations for Tbw and Tint is directly related to the number of memory accesses per iteration. From Equations (5.4) and (5.5), we know that the coprocessor memory cycle is 5880ns for the busy-wait, and 5760ns for the interrupt. Since 1260ns are introduced for each memory access per iteration with the busy-wait mechanism, and 1200ns are introduced for each memory access per iteration with the interrupt mechanism, we can formulate the following equations for Tbw and Tint , in terms of iterations and accesses: Tbw (iterations, accesses) = 5820 + ((5880 + (accesses − 1) × 1260) × iterations) (5.37) Tbw (iterations, accesses) = 5820 + ((4620 + (1260 × accesses)) × iterations) (5.38) Tint (iterations, accesses) = 16680 + ((5760 + (accesses − 1) × 1200) × iterations) (5.39) Tint (iterations, accesses) =16680 + ((4560 + (1200 × accesses)) × iterations) (5.40) Since our aim is to obtain the total number of memory accesses, for which the execution time for busy-wait is the same as for interrupt, we are interested in the difference between Tbw and Tint : 10860 accesses = 0 Tint − Tbw = 10860 − (60 + (60 × accesses)) × iterations accesses > 0 (5.41) Equation (5.41) does not include operations, since the difference between Tbw and Tint does not change with this parameter, as determined already in Equations (5.6) and (5.10). If we consider the average time required for each memory access in the interrupt, i.e. 1170ns, we will have the following equations for Tbw and Tint :
Summary
105
Tbw (iterations, accesses) = 5820 + ((4620 + (1260 × accesses)) × iterations) (5.42) Tint (iterations, accesses) = 16680 + ((4560 + (1170 × accesses)) × iterations) (5.43) Equation (5.41) can now be rewritten in terms of the average memory access time for the interrupt mechanism: 10860 accesses = 0 Tint −Tbw = 10860 − (60 + (90 × accesses)) × iterations accesses > 0 (5.44) Equation (5.44) shows the dependency on the difference between the bus arbitration for the busy-wait9 and interrupt mechanisms (90ns) and their handshake completion (10860ns).
5.4
Summary
In this chapter, we have investigated the performance of the co-design system, based on the simulation of the single-port shared memory configuration. For this purpose, we designed a benchmark program that allows us to analyze the performance for different numbers and rates of coprocessor memory accesses. Timing characteristics were provided for the two interface mechanisms used: busy-wait and interrupt. The relation between the coprocessor memory accesses and the interface mechanism was modelled. Since the bus arbitration mechanism has been shown to be directly related to the coprocessor memory accesses, we believe that some performance improvement can be obtained with the utilization of other memory architectures. The next chapter will investigate the proposal that a dual-port shared memory could produce better performance improvements than a single-port shared memory, due to the removal of the bus arbitration mechanism.
106
Shared Memory Configuration
Notes 1 The difference of 1 clock cycle between the time required to pass inst and the time required to pass the function’s parameters is due to the synchronization of the microcontroller and the coprocessor. As it was explained before, the first runs at 16MHz and the second at 8MHz. 2 We are working with only one coprocessor, implementing only one function. 3 The coprocessor memory write corresponds to the time during which Rnw is 0. 4 A bus cycle requires a minimum of 3 clock cycles, divided into 6 states, each one of half clock cycle. Specific tasks are assigned to each state, depending if it is a read or write operation. The microcontroller inserts wait states while the device does not answer the request during an asynchronous bus operation. 5 The number of iterations is equal to 0 and, then, the body of the function is not entered, which means that there are neither coprocessor internal operations nor memory accesses to be executed. 6 The total number of coprocessor memory accesses is given by the product iterations × accesses and the total number of coprocessor internal operations by the product iterations × operations. 7 As we saw in Section 5.2, the bus arbitration for busy-wait can vary from 3 to 6 clock cycles, but for interrupt it is always 3 clock cycles. 8 The coprocessor memory access corresponds to the time Nbgack is kept asserted. 9 The bus arbitration with busy-wait can last 180ns to 360ns, but in interrupt it lasts 180ns. We are considering here the average time in busy-wait (270ns), thus providing a difference of 90ns between the two mechanisms.
Chapter 6 DUAL-PORT MEMORY CONFIGURATION
The analysis carried out in the previous chapter lead us to the conclusion that the most suitable interface mechanism, which provides the best performance, depends on the form of the coprocessor memory accesses. We determined the relationship between these memory accesses and the interface mechanisms for a single-port shared memory. Due to the inherent bus contention problems, we concluded that a considerable amount of time is spent in the bus arbitration mechanism. Our aim now is to study the possibility of obtaining a further performance improvement, by applying a different memory to the co-design system. A dual-port memory has already been used in a previous co-design systems as a parameter memory (Edwards and Forrest, 1994; Edwards and Forrest, 1995; Edwards and Forrest, 1996b). In this case, our proposal is to replace the singleport shared memory of the original implementation by a dual-port memory and, thus hopefully, avoid bus contention problems. However, instead of bus contention, we might have memory contention, since the processor and coprocessor may now want to access memory location, at the same time. Since the coprocessor uses the main memory only for array accesses (all other data types and pointers to arrays are kept inside the coprocessor in memory-mapped registers), memory contention is a very remote possibility, unless the array is being shared, which is not the case in the present implementation. In order to compare the performance of the new memory system with the previous one, we assume a dual-port memory with the same experimental
107
108
Dual-Port Memory Configuration
Figure 6.1. Logic symbol of the dual-port memory
characteristics as the single-port memory, i.e. same size and access time. Also, we use the same procedure as the one adopted in Chapter 5 to obtain a new set of simulation results.
6.1
General Description
The dual-port memory considered here is based on the commercially available Am2130 (AMD, 1990) and has two independent ports called the left and right port. Each port consists of an 8-bit bi-directional data bus (data7 : 0) and a 10-bit address input bus (addr9 : 0), that is time multiplexed into row address and column address, plus the necessary control signals: row address strobe (Nras), column address strobe (Ncas), read/write (rNw). The dual-port memory contains an on-chip arbiter to resolve contention between the left and right ports. When contention between ports occurs, one port is given priority while the other port receives a busy indication (Nbusy). Another type of dualport memory does not provide for on-chip hardware arbitration, requiring then an external arbiter to handle simultaneous access to the same dual-port RAM location. Figure 6.1 shows the logic symbol of the dual-port memory under consideration.
6.1.1
Contention Arbitration
Two independent access facilities are usually provided in a dual-port memory to eliminate physical interference between data signals. However, there are two significant possibilities for logical interference which are not tolerable. In the first case, one port is reading from a location while the other port is writing into the same location at the same time. In this situation, data received by the
General Description
109
reading port may not be predictable. Similarly, consider the situation when both ports write information into the same location simultaneously. The resultant data that finally ends up in the memory location may not be valid. These two situations are commonly called contention. A dual-port memory usually has on-chip logic to detect contention and give priority to one port over the other. In a true dual-port memory, simultaneous reading from both ports at the same address does not corrupt the data. Hence, it can be construed that no contention occurs. However, for the sake of compatibility with the industry standard practices, the arbitration is based purely on addresses. Then, in the case of a simultaneous read from both ports at the same address, the arbitration logic will sense contention and give priority to one of the ports. The other port will receive a busy indication. Figure 6.2 is a conceptual logic diagram of contention arbitration, based on the commercially available dual-port memory described in Section 6.1. The left side comparator receives the left port row and column address and the delayed version of the right port row and column address. Similarly, the right side comparator receives the right port row and column address and the delayed version of the left port row and column address. The output of the comparator is connected to a latch composed of two cross-coupled NAND gates. The left row address strobe (Nras l) and the left column address strobe (Ncas l) are connected to a NOR gate to identify a valid left port address. Likewise, the right row address strobe (Nras r) and the right column address strobe (Ncas r) are connected to a NOR gate to identify a valid right port address. The outputs of the NOR gates are connected to the same latches used by the comparators. The Nbusy l and Nbusy r outputs are generated by gating the latch output with the appropriate row/column address strobe combination. Also, note that the latch outputs are used internally for left and right enable signals (en). The operation of the arbitration circuit can now be explained. Assume that the left port address is stable and Nras l and Ncas l are low. Both Q and Q outputs of the latch will be high, because the output of both comparators is low (the addresses are different). So, the Nbusy output of both sides is high. Now, assume that the right address becomes equal to the left address. The right address comparator output goes high and the Q output of the latch goes low. Eventually, the output of the left comparator also goes high, but the Q output remains high, because of the cross-coupling of the Q . When the Nras l and Ncas l inputs go low, Nbusy r becomes low. Thus, the arbitrator gave priority to the left port. Sooner or later, the left port will finish its transaction at the selected memory location and change the address or its Nras l/Ncas l will go
110
Dual-Port Memory Configuration
Figure 6.2. Arbitration logic
high. Thus, when the contention is over, the Q output of the latch will become high and Nbusy r will go high. A similar reasoning can be used to understand the operation of the left side. In the cases of contention, the arbiter will decide one port as the winner and the losing port must wait for the winner to complete the use of the memory. The winning port must indicate to the arbiter that it has completed its operation either by changing the address or making its address strobe inputs high. Without such indication, the arbiter will not remove the busy indication to the losing port.
6.1.2
Read/Write Operations
Performing read/write operations when there is no contention is relatively straightforward. The signal Nras negative transition is followed by Ncas negative transition for all read and write cycles. The delay between the two transitions, referred to as multiplex window, gives flexibility to set up the external addresses. Signal rNw must be stable before signal Ncas negative transition. When a read or write operation is initiated by a port and contention from the other port occurs, the implications are very simple. The losing port sees its Nbusy line go low and it must wait until it goes high. Thus, in this case of contention, the operation does not really start when the port initiates it. Instead, the operation starts when the Nbusy line goes high.
111
The System Architecture
Figure 6.3. Main system architecture
6.2
The System Architecture
Considering the main system presented in Chapter 4, only the memory component (DRAM16), coprocessor board component and main memory controller component ((COPRO BOARD) and FPGA MAIN CTRL respectively) were modified. For simplicity, we maintain the same terminology, except when a new component is introduced. Figure 6.3 shows the new architecture of the main system. The shared memory component was substituted for the dual-port, keeping some of the original basic characteristics, such as memory access time and time multiplexed row/column addresses. The coprocessor board, shown in Figure 6.4, was modified to include a memory controller component, a simplified bus interface controller component (since now the coprocessor does not need to use the system bus for memory accesses) and new connections to the coprocessor dual-port data buffers, which are located inside the coprocessor component. The main controller, which controls the memory accesses required via the main system bus, was modified to identify the busy flag sent by the dual-port memory.
6.2.1
Dual-Port Memory Model
The memory DRAM16 consists of two 1M×8 bits DRAM, as described in Chapter 4. The memory model was modified to behave as the dual-port memory shown in Figure 6.5. Signals assigned to the left port are suffixed with “ l”, while those assigned to the right port are suffixed with “ r”. To enable byte access, each DRAM must be selected individually: the byte from an even address is related to the most significant byte of the system data bus (data15 : 8) and that from an odd address is related to the least significant byte of the system data bus (data7 : 0). Signals Ncas0 and Ncas1 correspond to even and odd
112
Dual-Port Memory Configuration
Figure 6.4. Coprocessor board for the dual-port configuration
Figure 6.5. Logic symbol of DRAM16
column address strobes, respectively. Similarly, signals Nbusy0 and Nbusy1 correspond to even and odd busy flags, respectively. Each DRAM has a memory array defined to be used by both ports. There is an asynchronous state machine for the left side and another one for the right side, each of which is an instance of that described in Figure 6.61 . In these state machines, the signals are named accordingly, except the shared memory array. The state machine remains in the initial state idle, keeping the data bus disabled, until one of the address strobe signals is asserted. If Nras is asserted before Ncas, the row address is read from the system address bus. If Ncas is
The System Architecture
113
asserted before Nras, the column address is read from the system address bus and a refresh cycle starts. Once in state ras before cas, the state machine waits for Ncas to be asserted in order to read the column address from the system address bus. If the memory array is accessible (en = 1) and Ncas is asserted the operation proceeds. Meanwhile, if Nras is negated, the state machine returns to idle. The state machine remains in state ras and cas waiting for the operation to complete. Once the state is reached, the next step is to identify the type of the operation: if it is a read operation (rNw = 1), the data bus receives the content of the memory array addressed by row and col; if it is a write operation (rNw = 0), the content of the memory array addressed by row and col receives the data bus. The process is repeated until the state machine changes state. As soon as Ncas is negated, the state machine returns to idle, if the operation has already finished (Nras = 1); otherwise, it returns to ras before cas, since Nras is still asserted. State cas before ras is entered only when the refresh operation is being carried out. Once Nras is asserted, the state machine enters a refresh state, that is left only when Nras and Ncas are negated, returning the state machine to its initial state. The refresh operation is not implemented, following the simplified approach proposed in Chapter 4 for simulation.
6.2.2
The Coprocessor
This component includes the data buffers, the clock divider, the accelerator and some glue logic. Due to the connection to the dual-port memory, extra signals were included, based on the original implementation, such as dual-port address and data buses. Figure 6.7 presents the coprocessor logic symbol. The input data can be parameters passed by the microcontroller or data from the dual-port memory, during a coprocessor read operation. Output data can be returned parameters to the microcontroller or data to the dual-port memory, during a coprocessor write operation. In this manner, extra control had to be added to the original implementation, described in Chapter 4, due to the connection of the coprocessor data buffers to two different buses: main system data bus and dual-port data bus.
6.2.3
Bus Interface Controller
This component, shown in Figure 6.8, is no longer concerned with the coprocessor memory accesses, but only with the microcontroller accesses to the control register and for parameter passing. This implies the suppression of the coprocessor memory controller, thus reducing the logic.
114
Dual-Port Memory Configuration
Figure 6.6. State machine for the dual-port memory model
6.2.4
Coprocessor Memory Controller
This is a new component responsible for the coprocessor memory accesses to its dual-port memory side, as described in Figure 6.9. In the original implementation, this task was performed by the bus interface controller and the main memory controller. There are basically two synchronous state machines. Figure 6.10 shows the first one, designed to control the coprocessor memory accesses. This state machine identifies a coprocessor memory request, starts the necessary control signals and controls the completion of the operation. The controller starts in the initial state idle, waiting for a coprocessor memory request. While Nmemreq is asserted, a valid address is sent by the coprocessor and the memory access starts.
The System Architecture
115
Figure 6.7. Logic symbol of the coprocessor for the dual-port configuration
Figure 6.8. Logic symbol of the bus interface controller
In state c1 ds, the controller waits for the memory access to complete, which is indicated by the signal Ncas01, asserted by the DRAM controller. The address strobe signal (Nas) and the data strobe signal (Nds) are asserted, indicating the beginning of a memory access. The signal dataen1 : 0 indicates an operation with the low order word of the 32-bit internal buffer. State c¨2 addr is entered during a long word access only and introduces a delay before starting a new memory access. During this state, the address of the next word is provided, by asserting signal addr inc. Since a new memory
116
Dual-Port Memory Configuration
Figure 6.9. Logic symbol of the coprocessor memory controller
access is required, signals Nas and Nds are negated. In the same manner, signal dataen1 : 0 disables the internal buffers. Once in state c2 ds, the address to the second word of a long word access is valid and the signals Nas and Nds are asserted. Signal dataen1 : 0 indicates the access to the high order word of the 32-bit internal buffer. The controller stays in this state until the memory access finishes. State complete indicates the end of the memory access required by the coprocessor, regardless of whether it was a byte/word or long word access. Signal Nmemfin is then asserted, which allows the coprocessor to identify the end of the memory access. Signals Nas, Nds and Nmemfin change state in the falling edge of the clock, while the controller changes state in the rising edge of the clock. This delay is needed to guarantee a signal’s stability when it is required by another state machine, that changes state in the rising edge of the clock too. Besides this, Nmemfin is negated immediately after Nmemreq is asserted. Figure 6.11 presents the second state machine included in the coprocessor memory controller. The DRAM state machine was designed to control the memory access itself. It identifies a valid address and asserts the necessary controls, such as Nras and Ncas. The DRAM controller remains in state idle until a valid address is identified by the assertion of signal Nisdram. Since the row and column addresses are time multiplexed, the row address is sent first, which is indicated by keeping the signal cuNl high. As soon as state ras is entered, the row address strobe is asserted (Nras) and column address is required, driving cuNl to low. As a matter of fact, this signal changes in the falling edge of the clock, providing some delay before the multiplexed address bus changes to the column address. Immediately after this transition, the row and column addresses are ready and the busy flag is valid.
The System Architecture
117
Figure 6.10. State machine of the coprocessor memory accesses controller
If the access to the memory is allowed, the controller enters state rascas, indicating the beginning of the access itself, by asserting signal Ncas01. When the address strobe Nas is negated, the controller identifies the end of the memory access and returns to its initial state.
6.2.5
The Main Controller
This component is responsible for the memory accesses requested from the main system bus. It includes two new inputs Nbusy0 and Nbusy1, corresponding
118
Dual-Port Memory Configuration
Figure 6.11. State machine of the DRAM controller
to the busy flags provided by the dual-port memory, as shown in Figure 6.12. Based on this, the DRAM controller, which is part of the main controller, was modified too, to identify the presence of the busy flags, in a similar manner as the one described in Figure 6.11.
6.3
Timing Characteristics
The new model, described in the previous section, was validated for read and write operations. They were performed successfully by reading the expected data and writing in the appropriate locations of the memory. In the same manner as it was done for the single-port shared memory, we must bear in mind the basic timings below. They are based on the main system clock and given in terms of clock cycles: microcontroller control register write (8 bits): 240ns (4 clock cycles) microcontroller control register read (8 bits): 180ns (3 clock cycles)
Timing Characteristics
119
Figure 6.12. Logic symbol of the main controller
microcontroller memory read (16 bits): 240ns (4 clock cycles) microcontroller memory write (16 bits): 240ns (4 clock cycles) coprocessor memory read (16 bits): 300ns (5 clock cycles) coprocessor memory write (16 bits): 300ns (5 clock cycles) Notice that these times are the same as those for the single-port memory (see Section 5.2), since we are keeping the same memory characteristics in terms of access time and time multiplexed row and column addresses. Parameter passing is done through memory-mapped registers, as before, and the related times are the same as those provided in Section 5.2.1: microcontroller sends inst (8 bits): 540ns (9 clock cycles) microcontroller sends first parameter table (32 bits): 1200ns (20 clock cycles) microcontroller sends next parameters (16 bits each): 600ns (10 clock cycles each) Bus arbitration is not required when the coprocessor makes a memory access, since it has its own memory port. Nevertheless, the handshake completion is still performed and it depends on the interface mechanism.
120
6.3.1
Dual-Port Memory Configuration
Interface Mechanisms
Since each processor has its own memory port, there is neither bus contention nor bus arbitration, but there is still the handshake completion. There might be memory contention, but this problem is solved by the memory internal arbitration logic, discussed in Section 6.1.1. When using the busy-wait mechanism, the microcontroller busy-wait cycle takes 5 clock cycles, 3 clock cycles to read the control register and 2 clock cycles to prepare for the next cycle. The handshake completion is directly related to the microcontroller busy-wait cycle, varying from 5 to 9 clock cycles. The description of the handshake completion for the single-port memory, presented in Section 5.2.3, also applies here, since the bus interface controller is still responsible for this task and there are no changes in its protocol: busy-wait cycle: 300ns (5 clock cycles) handshake completion: 300ns to 540ns (5 to 9 clock cycles) For the interrupt mechanism, the procedure is the same as described in Section 5.2.4: interrupt acknowledge cycle: 10740ns (179 clock cycles) handshake completion: 11160ns (186 clock cycles) The difference now between the two interface mechanisms is basically associated with the completion of the handshake as in (6.1): handshake completionint = handshake completionbw = 10620ns to 10860ns
(6.1)
We can expect a better performance of the dual-port memory, compared to that of the single-port memory, as there is no bus contention. The time corresponding to the bus arbitration, which was added to every coprocessor memory access, is now eliminated. With the dual-port memory, only the completion of the handshake is taken into account.
Performance Results
121
Table 6.1. Performing 10 internal operations and a single memory access per iteration iterations Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) 0 5820 16680 10860 – – 1 11220 22080 10860 5400 5400 2 5400 16620 27480 10860 5400 3 22020 32880 10860 5400 5400 8 10860 5400 5400 49020 59880 9 54420 65280 10860 5400 5400 10 5400 59820 70680 10860 5400 20 113820 124680 10860 54000 54000 30 167820 178680 54000 54000 10860 80 437820 448680 10860 54000 54000 90 491820 502680 10860 54000 54000 100 545820 556680 10860 54000 54000 540000 540000 200 1085820 1096680 10860 300 1625820 1636680 10860 540000 540000
6.4
Performance Results
Without bus arbitration, the performance of the system depends only on the handshake completion, which is independent of the coprocessor memory accesses. As a matter of fact, the handshake completion depends only on the interface mechanism. Thus, we can conclude that the performance of the system is directly related to the interface mechanism. We analyze the execution times of the example proposed in Figure 5.1 (see Section 5.1) for different values of its parameters and applying the two interface mechanisms. This will provide us with a new set of simulation results to be compared to the ones obtained for the original implementation, in order to identify any performance improvement.
6.4.1
Varying Internal Operations and Memory Accesses
When varying iterations, we are altering the number of coprocessor internal operations and the number of coprocessor memory accesses, at the same time. Table 6.1 shows the execution times for busy-wait (Tbw ) and interrupt (Tint ) as a function of iterations, performing 10 coprocessor internal operations (operations = 10) and 1 memory access per iteration (accesses = 1). The initial time for Tbw and Tint corresponds to the time necessary to run the coprocessor, without executing the function’s main loop, which means the time spent in parameter passing, some coprocessor internal operations (to deal with the calculations of the loop controlled by iterations) and completion of the handshake. These times are the same as the ones obtained from the original
122
Dual-Port Memory Configuration
implementation, since there are no coprocessor memory accesses and parameter passing follows the same procedure. The difference between Tint and Tbw corresponds to the difference between their handshake completion as in (6.2): Tint (0) − Tbw (0) = 10860ns(181clock cycles)
(6.2)
There is one memory access at every 5400ns for the busy-wait mechanism, as well as for the interrupt. This time includes the execution of 10 coprocessor internal operations, coprocessor internal control calculations and coprocessor memory write as in (6.3) for the busy-wait and (6.4) for the interrupt mechanism: Tbw (iterations) = 5820 + (5400 × iterations)
(6.3)
Tint (iterations) = 16680 + (5400 × iterations)
(6.4)
The difference between the execution times obtained for busy-wait and interrupt is constant and equal to the difference between their handshake completions (10860ns), which shows the dependency of the system’s performance on the interface mechanism only. When we compare (6.3) and (6.4) to (5.4) and (5.5), for the original implementation, we notice that they differ only in the coefficient for iterations. Considering first the busy-wait mechanism, we reach the result in (6.7): Tbw (iterations)single = 5820 + (5880 × iterations)
(6.5)
Tbw (iterations)dual = 5820 + (5400 × iterations)
(6.6)
Tbw (iterations)single − Tbw (iterations)dual = 480 × iterations
(6.7)
We can conclude from the results obtained that the busy-wait mechanism offers shorter execution times than the interrupt mechanism. Figure 6.13 shows the chart of the execution times for both interface mechanisms, as a function of iterations. The result of Equation (6.4) corresponds to the different way in which each implementation deals with the coprocessor memory accesses. Figure 6.14 shows a coprocessor memory access taking place from the moment Nmemreq is asserted, in the rising edge of the clock, until the next rising edge of the clock just after Nas is negated and Nmemfin is asserted. In the original implementation, before a memory access starts, a bus request takes place, with the assertion of Nbr and the arbitration process begins. Comparing Figure 6.14 to Figure 5.5 and Figure 5.6, in Section 5.2, we see that two events are not happening now: first, there is a delay between Nmemreq and Nbr (120ns); second, the bus arbitration time (360ns). These two times correspond to the differences obtained in (6.4).
Performance Results
123
Figure 6.13. Chart for Tb and Ti , in terms of iterations
Notice that Nmemreq, Nmemfin and Nas are now being driven by the coprocessor memory controller2 , and no longer by the bus interface. The memory controller runs at the same speed as the microcontroller, using the system clock (clk). On the other hand, signal rNw, driven by the controller, follows the signal Nfpga rw, driven by the accelerator, whose clock (copro clk) is half the frequency of the system clock. This explains why rNw is still low after the memory write has finished. For the interrupt mechanism, we have the result in (6.10): Tint (iterations)single = 16680 + (5760 × iterations)
(6.8)
Tint (iterations)dual = 16680 + (5400 × iterations)
(6.9)
Tint (iterations)single − Tint (iterations)dual = 360 × iterations (6.10) The same reasoning can be used for the interrupt mechanism to explain the result of (6.6), since the coprocessor memory accesses are independent of the interface mechanism. Nevertheless, we should obtain a result equal to 300ns, considering that the bus arbitration takes 180ns. Observing Figure 6.15, this difference of 60ns can be explained, which shows a coprocessor memory access. We see that the memory access finishes 270ns after it is requested and there is no delay involved. The accelerator identifies the end of the memory access when
124
Dual-Port Memory Configuration
Figure 6.14. Coprocessor memory access
Mem fin is asserted, then proceeds with its operation. This signal corresponds to Nmemfin, that is sent by the memory controller. Since Nmemfin is an input to the accelerator, it will be sampled on the falling edge of its clock (copro clk). If Nmemfin is asserted just before the falling edge of copro clk, a delay of 30ns is introduced, as in our case. But if Nmemfin is asserted just after the falling edge of copro clk, the delay will be 90ns, which is the case for the original implementation. This can be seen from Figure 6.15, if we introduce 180ns for Nbr, just after Nmemreq, plus 270ns for the assertion of Nmemfin to happen. We observe that Nmemfin is asserted just after the falling edge of copro clk. So, this explains the difference of 60ns between the result of (6.6) and the one expected, when considering the bus arbitration only.
Performance Results
125
Figure 6.15. Synchronization between memory controller and accelerator
Table 6.2 presents the execution times as a function of iterations, performing 100 internal operations and one memory access per iteration. The difference between the execution times for both mechanisms is constant and equal to the difference between their handshake completions. These results lead to the same conclusion as before, i.e., the busy-wait is better than the interrupt mechanism, independently of the number of coprocessor internal operations and memory accesses. There is one memory access at every 37800ns for the busy-wait mechanism, as well as for the interrupt. This time includes the execution of 100 coprocessor internal operations, coprocessor internal control calculations and coprocessor memory write as in (6.11) for the busy-wait and (6.12) for the interrupt mechanism: Tbw (iterations) = 5820 + (37800 × iterations)
(6.11)
Tint (iterations) = 16680 + (37800 × iterations)
(6.12)
126
Dual-Port Memory Configuration
Table 6.2. Performing 10 internal operations and a single memory access per iteration iterations Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) – 0 5820 16680 10860 – 1 37800 37800 43620 54480 10860 2 81420 92280 10860 37800 37800 3 119220 130080 10860 37800 37800 8 10860 37800 37800 308220 319080 9 37800 346020 356880 10860 37800 10 383820 394680 10860 37800 37800 772680 10860 378000 378000 20 761820 30 1139820 1150680 10860 378000 378000 80 3029820 3040680 10860 378000 378000 90 3407820 3418680 10860 378000 378000 100 3785820 3796680 10860 378000 378000 200 7565820 7576680 10860 3780000 3780000 300 11345820 11356680 10860 3780000 3780000
The difference between these two equations is constant and corresponds to the difference between the handshake completions of the interface mechanisms. This means that the behavior of the system does not change when the number of coprocessor internal operations changes. Comparing (6.11) and (6.12) to (5.8) and (5.9), we notice that they differ in the coefficient of iterations by the same amount as when performing 10 coprocessor internal operations, discussed at the beginning of this section: Tbw (iterations)single = 5820 + (38280 × iterations)
(6.13)
Tbw (iterations)dual = 5820 + (37800 × iterations)
(6.14)
Tbw (iterations)single − Tbw (iterations)dual = 480 × iterations (6.15) Tint (iterations)single = 16680 + (38160 × iterations)
(6.16)
Tint (iterations)dual = 16680 + (37800 × iterations)
(6.17)
Tint (iterations)single − Tint (iterations)dual = 360 × iterations (6.18)
6.4.2
Varying the Memory Access Rate
Table 6.3 presents the execution times for busy-wait and interrupt, varying the number of coprocessor internal operations (operations), but performing 10 iterations and 1 memory access per iteration. As discussed in the previous
Performance Results
127
Table 6.3. Performing 10 iterations and a single memory access per iteration operations Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) 0 23820 34680 10860 – – 1 27420 38280 10860 3600 3600 2 3600 31020 41880 10860 3600 3 34620 45480 10860 3600 3600 8 10860 3600 3600 52620 63480 9 56220 67080 10860 3600 3600 10 3600 59820 70680 10860 3600 20 95820 106680 10860 36000 36000 30 131820 142680 36000 36000 10860 80 311820 322680 10860 36000 36000 90 347820 358680 10860 36000 36000 100 383820 394680 10860 36000 36000 360000 360000 200 743820 754680 10860 300 1103820 1114680 10860 360000 360000
chapter, when changing the number of internal operations, we also change the coprocessor memory access rate. We see from these results that the difference between the execution times obtained for both interface mechanisms is the same and equal to the difference between their handshake completions. We can conclude that the performance of the system does not depend on the coprocessor memory access rate, but on the interface mechanism used. Since the busy-wait offers the shortest handshake completion, it is then the mechanism providing the shortest execution times.
6.4.3
Varying the Number of Memory Accesses
In order to analyze the interference of the coprocessor memory accesses on the behavior of the execution times for the busy-wait and interrupt mechanisms, we will vary the number of memory accesses, by changing the parameter accesses. Table 6.4 presents the execution times obtained for both mechanisms, performing 10 iterations and 10 coprocessor internal operations. In spite of changing the number of memory accesses, the results provided show that the difference between the execution times for both interface mechanisms depends on the difference between their handshake completions only. Therefore, the busy-wait is still the best mechanism to apply. From the above results, we can formulate (6.19) fro the busy-wait and (6.20) for the interrupt mechanism: Tbw (accesses) = 50220 + (8400 × accesses)
(6.19)
Tint (accesses) = 61080 + (8400 × accesses)
(6.20)
128
Dual-Port Memory Configuration
Table 6.4. Performing 10 iterations and 10 internal operations accesses Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) 0 50220 61080 10860 – – 1 70680 10860 9600 9600 59820 2 68220 79080 10860 8400 8400 3 10860 8400 8400 76620 87480 8 118620 129480 10860 8400 8400 9 127020 137880 10860 8400 8400 10 135420 146280 10860 8400 8400 20 219420 230280 10860 84000 84000 30 303420 314280 10860 84000 84000 80 723420 734280 10860 84000 84000 90 807420 818280 10860 84000 84000 100 891420 902280 10860 84000 84000
Table 6.5 shows the execution times for the busy-wait and interrupt mechanisms as a function of the number of iterations, performing 10 coprocessor internal operations and 2 memory accesses per iteration. We can identify some changes in the difference between the execution times for both mechanisms, but it becomes “steady” when the number of iterations is greater than 10. Despite the changes in the number of coprocessor memory accesses, the difference between the execution times for both interface mechanisms presents the same behavior as in the previous cases, showing the dependency only in the differences between their handshake completions. Since the busy-wait mechanism provides the shortest handshake completion, it yields shorter execution times. We can formulate (6.21) for Tbw and (6.22) for Tint , in terms of iterations: Tbw (iterations) = 5820 + (6240 × iterations)
(6.21)
Tint (iterations) = 16680 + (6240 × iterations)
(6.22)
Comparing 6.21 and 6.22 to (6.3) and (6.4), respectively, we conclude that each memory access introduces an extra 840ns to the execution times. Recall that in the original implementation, each memory access adds 1260ns to the busy-wait execution times (5.30) and 1200ns to the interrupt execution times (5.35), we conclude that 420ns are saved when using the busy-wait mechanism, and 360ns are saved when using the interrupt mechanism, for each memory access in the dual-port. So, compared to the single-port memory, we can expect a speedup of 1.5 using the busy-wait and of 1.43 using the interrupt.
129
Performance Results Table 6.5. Performing 10 internal operations and 2 memory accesses per iteration
iterations Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) 0 5820 16680 10860 – – 1 12120 22920 10800 6300 6240 2 6240 18420 29160 10740 6300 3 24720 35400 10680 6300 6240 8 10680 6300 6240 55920 66600 9 62220 72840 10620 6300 6240 10 6240 68220 79080 10860 6000 20 130620 141480 10860 62400 62400 30 193020 203880 62400 62400 10860 80 505020 515880 10860 62400 62400 90 567420 578280 10860 62400 62400 100 629820 640680 10860 62400 62400 624000 624000 200 1253820 1264680 10860 300 1877820 1888680 10860 624000 624000
6.4.4
Speedup Achieved
The speedup S achieved is based on the comparison between the original single-port and the dual-port shared memory s. In the previous sections, we obtained some performance results by varying the function’s parameters of example 5.1. We will now analyze the performance improvement produced by the dual-port memory implementation. Considering (5.4) and (6.3), for the busy-wait, and (5.4) and (6.4), for the interrupt, which correspond to 10 coprocessor internal operations and 1 memory access per iteration, the speedup achieved, using the busy-wait (Sbw ) and interrupt (Sint ) mechanisms in terms of iterations is: Sbw (iterations)single/dual =
Tbw (iterations)single Tbw (iterations)dual
(6.23)
Sint (iterations)single/dual =
Tint (iterations)single Tint (iterations)dual
(6.24)
The speedup approaches an upper bound determined in (6.25) for the busywait and (6.26) for the interrupt mechanism. lim
Sbw (iterations)single/dual = 1.09
(6.25)
lim
Sint (iterations)single/dual = 1.09
(6.26)
iterations→∞
iterations→∞
When performing 10 coprocessor internal operations and 2 memory accesses per iteration, the maximum speedup that can be achieved, considering Equations
130
Dual-Port Memory Configuration
(5.19), (5.20), (6.21) and (6.22), is given in (6.27) for the busy-wait and (6.28) for the interrupt mechanism. lim
Sbw (iterations)single/dual = 1.14
(6.27)
lim
Sint (iterations)single/dual = 1.12
(6.28)
iterations→∞
iterations→∞
The last results indicate that the dual-port yields some improvement, as the number of coprocessor memory accesses increases. Now, considering Equations (5.7), (5.8), (6.9) and (6.10), which are based on accesses and correspond to the execution of 10 iterations and 10 coprocessor internal operations, we can formulate the speedup Sbw (in (6.29)) and Sint (in (6.30)) in terms of the number of coprocessor memory accesses. Sbw (accesses)single/dual =
Tbw (accesses)single Tbw (accesses)dual
(6.29)
Sint (accesses)single/dual =
Tint (accesses)single Tint (accesses)dual
(6.30)
As the number of coprocessor memory accesses increases, the speedup approaches an upper bound determined in (6.31) for the busy-wait and (6.32) for the interrupt mechanism. lim
Sbw (accesses)single/dual = 1.50
(6.31)
lim
Sint (accesses)single/dual = 1.43
(6.32)
accesses→∞ accesses→∞
6.5
Summary
This chapter has described a new system, based on a dual-port memory. Our aim was to improve the performance of the system, by eliminating bus contentions during the coprocessor memory accesses. The new simulation model was validated and proved satisfactory. The speedups obtained show that the execution times can be reduced. On the other hand, the complexity of the hardware and, therefore, the cost of the design has increased. In our search for further performance enhancements, we introduce, in the next chapter, a cachecache memory for the coprocessor, whilst using the original single-port shared memory . In the following chapter, the co-design system model will be used for a set of simulation-based performance studies. These simulations aim at identifying the relation between the coprocessor memory accesses and the two interface mechanisms. In later chapters, different memory s will be introduced and some of the modules defined in this chapter will be modified accordingly.
Summary
131
Notes 1 It is assumed that output signals omitted in a transition keep their previous values. 2 In Figure 6.14, we identify C2:U3 as the coprocessor memory controller VHDL component.
Chapter 7 CACHE MEMORY CONFIGURATION
In the previous chapter, we studied the possibility of obtaining a performance improvement by the use of a dual-port shared memory, instead of the singleport shared memory of the original implementation. The aim was to avoid bus contention and bus arbitration, which are naturally present in any shared bus architecture. The results indicated that whilst an improvement can be achieved, it is not as significant as expected. The dual-port memory configuration provided us with a different interconnection between the processor and the memory, but keeping the same memory hierarchy. The characteristics of the dual-port memory were the same as those of the DRAM used for the single-port. Given this arrangement, the memory characteristics of the co-design system did not change, but the access to it did. A coprocessor cache memory introduces another level in the memory hierarchy, providing a shorter access time, but smaller size, because a cache memory is usually implemented with RAM. Our aim is to reduce further the coprocessor memory access time, whilst avoiding bus contention and bus arbitration. The memory size required by the coprocessor depends on the array sizes, since there is no other data type involved in a coprocessor memory access. In a similar manner to the dual-port shared memory configuration, we follow the same procedure to evaluate a system’s performance, using, once again, the program of Figure 5.1 (see Section 5.1). New simulation results are obtained and compared with the single-port memory implementation, in order to determine whether any further performance improvement has been achieved.
133
134
7.1
Cache Memory Configuration
Memory Hierarchy Design
In most technologies one can obtain smaller memories that are faster than larger memories. The fastest memories are generally available in smaller numbers of bits per chip, but they cost substantially more per byte. We are always in search of maximum performance at a minimum cost. The type of memory to use in a design will depend on the trade-off between the performance required and the cost of the implementation. The principle of locality of reference (Hennessy and Patterson, 1990a; Patterson and Hennessy, 1994) states that the data most recently used are likely to be accessed again in the near future. Because smaller memories will be faster, we want to use smaller memories to hold the most recently accessed items. On the other hand, the cost of this implementation increases, since these memories are expensive. The coprocessor uses the memory to access data arrays only. The characteristic of this data type allows us to assume that the principle of locality applies. This yields the idea of providing the coprocessor with a small but fast memory, thus avoiding the use of the large and slow main shared memory. With the use of a coprocessor local memory, we introduce a new memory hierarchy in the design.
7.1.1
General Principles
A memory hierarchy normally consists of many levels, but it is managed between two adjacent levels at a time. The upper level, closer to the processor, is smaller and faster than the lower level. The minimum unit of information that can be transferred between the two-level hierarchy is called a block. The memory address is then divided into a block-frame address, corresponding to the higher-order part of the address and identifying the block at that level of the hierarchy, and the block-offset address, corresponding to the lower-order part of the address and identifying an item within a block. Figure 7.1 shows the address for the upper-level and the address for the lower-level. A memory access found in the upper level is a hit hit, while a miss means it is not found at that level. The fraction of memory accesses found in the upper level is the hit rate and is usually represented by a percentage. A miss rate is the fraction of memory accesses not found in the upper level and corresponds to (1.0−hit rate). These measures are important in determining the performance of the system. The time to access the upper level of the memory hierarchy is called the hit time, which includes the time to determine whether the access is a hit or a miss. The time to replace a block in the upper level with the corresponding block from the lower level, plus the time to deliver this block to the requesting device, is called the miss penalty.
Memory Hierarchy Design
135
Figure 7.1. Addressing different memory levels
7.1.2
Cache Memory
A Cache (Hennessy and Patterson, 1990a; Hennessy and Patterson, 1990b; Hwang and Briggs, 1985) represents the level of the memory hierarchy between the processor and the main memory. In our case, the cache is implemented between the coprocessor and the main memory. With this, we try to reduce the coprocessor shared memory accesses, thus avoiding bus contention, and provide the coprocessor with a faster memory access time than that of the shared memory. There are, basically, three categories of cache organization, based on the block replacement policy adopted, as shown in Figure 7.2: direct mapped, fully associative and set associated. Direct Mapping. This is the simplest of all organizations, in which each block has only one place in the cache. The main memory address consists of three fields: tag, block and word. Each block frame has its own specific tag associated with it. When a block of the memory exists in a block frame in the cache, the tag associated with that frame contains the high-order bits of the main memory address of that block. When a physical memory address is generated for a memory reference, the block address field is used to address the corresponding block frame, the mapping obtained as (block-frame address) modulo (number of blocks in the cache). The tag address field is compared with the tag in the
136
Cache Memory Configuration
Figure 7.2. Block placement policies, with 8 cache blocks and 32 main memory blocks
cache block frame. If there is a match, the data in the block frame is accessed by using the word address field. This scheme has the advantage of permitting simultaneous access to the desired data and tag. If there is no tag match, the output data is discarded. No associative comparison is needed and, hence, the cost is reduced. The direct mapped cache also has the advantage of a trivial replacement algorithm by avoiding the overhead of record keeping associated with the replacement rule. Of all the blocks that map into a block frame, only one can actually be in the cache at a time. Hence, if a block caused a miss, we would simply determine the block frame this block maps onto and replace the block in that block frame, which occurs even when the cache is not full. A disadvantage of direct mapping is that the cache hit ratio drops sharply if two or more blocks, used alternately, happen to map onto the same block frame in the cache. The possibility of this contention may be small if such blocks are relatively far apart in the processor address space. Fully Associative. In terms of performance, this is the best and most expensive cache organization. The mapping is such that any block in memory can be in any block frame. The main address consists of two fields: tag and word. When a request for a block is presented to the cache, all the map entries are compared simultaneously (associatively) with the request to determine if the request is present in the cache. Although the fully associative cache eliminates the high block contention, it encounters longer access time because of the associative search.
Memory Hierarchy Design
137
Set Associative. This represents a compromise between direct and associative mapping organizations. In this scheme, the cache is divided into S sets, each of which containing a number of block frames determined by the total number of block frames in the cache divided by the number of sets. A block can be placed in a restricted set of places in the cache. The main address field consists of: tag, set and word. The simplest and most common scheme used for mapping a physical address into a set number is the bit-selection algorithm. In this case, the number of sets S is a power of 2 (say 2k ). If there are 2j words per block, the first j bits select the word within a block, and bits j to j + k − 1 select the set via a decoder. Hence, the set field of the memory address defines the set of the cache, which may contain the desired block, as in the direct mapping scheme. The tag field of the memory address is then associatively compared to the tags in the set. If a match occurs, the block is present. The cost of the associative search, in a fully associative cache, depends on the number of tags (blocks) to be simultaneously searched and the tag field length. The set associative cache attempts to cut this cost down and yet provide a performance close to that of the fully associative cache. In order to decide the most appropriate cache organization, we consider the characteristics of our application. The coprocessor memory accesses are restricted to data arrays only. In the example considered (see Figure 5.1 in Section 5.1), the data array is used inside a loop, with the possibility of yielding a sequence of accesses to the same location as well as to consecutive locations. Therefore, the probability of having most of the accesses to the same block is high and the probability of having blocks mapping to the same cache frame is low. Due to these characteristics, we choose the direct mapping scheme as our block replacement policy for its simplicity, requiring less logic than the other schemes, and low cost implementation. There are two options when writing to the cache, which are called write policies: write through: data is written to both the block in the cache and the block in the lower-level memory; write back: data is written only to the block in the cache; the modified cache block is written back to the lower-level memory when it is replaced. Write-back cache blocks are called dirty if the information in the cache differs from that in the main memory. In order to reduce the frequency in writing back blocks on replacement, a dirty bit is used. If this flag is not set, it indicates that the block was not modified while in the cache and so does not need to be written in the main memory. Due to the characteristics of our application, described above for the block replacement scheme, we opt for the write back policy since writes occur at the speed of the cache memory and multiple writes within a
138
Cache Memory Configuration
Figure 7.3. Coprocessor board for the cache memory configuration
block require only one write to the main memory. Hence, write back uses less memory bandwidth.
7.2
System Organization
The dual-port memory configuration, discussed in the previous chapter, required a new main system configuration due to the substitution of the singleport memory by a dual-port memory. In the present implementation, we return to the main system configuration presented in Chapter 4. Basically, only the coprocessor board is modified to include a cache memory, as shown in Figure 7.3. The coprocessor bus interface controller is modified to include the task of dealing with the block transfer between the cache and the main memory, and a coprocessor memory controller is included, in the same manner as the one discussed in Chapter 6, to deal with the coprocessor accesses to the cache.
7.2.1
Cache Memory Model
We are initially considering a cache size of 8KB. The cache memory SRAM16 consists of two 4K × 8 bits SRAM, as described in Figure 7.4, based on commercially available SRAMs. To enable byte access, each SRAM must be selected individually: the byte from an even address is related to the most significant byte of the system data bus (data15 : 8) and the byte from an odd address is related to the least significant byte of the system data bus (data7 : 0). Signals Nce0 and Nce1 correspond to even and odd chip enables respectively.
System Organization
139
Figure 7.4. Logic symbol of the cache memory model
During a read cycle, the address must be valid for a minimum of 30ns (read cycle time). Data-out is valid in 30ns, at most, from the moment the address becomes valid (address access time) or from the moment the chip enable signal is asserted (chip enable access time). Data-out is valid for a minimum of 5ns after the chip enable signal is negated or the address changes. During a write cycle, the internal write time of the memory is defined by the overlap of Nce and Nwe when they are active. Both signals must be low to initiate a write and either signal can terminate a write by going high. The address must be valid for 30ns at least (write cycle time). The write enable Nwe must be active for 25ns, at least. Data in must be valid at least 15ns before Nwe or Nce is negated, and must be held for a minimum of 5ns after the negation of one of these signals. The chip enable Nce must be active for at least 25ns. Figure 7.5 shows a memory read/write operation. The operation starts once the chip enable signal Nce is asserted. Then, the kind of operation is determined by the write enable signal Nwe: if it is a write operation (Nwe = 0), the content of the memory addressed by addr receives the content of the data bus; if it is a read (Nwe = 1), the data bus receives the content of the memory addressed by addr.
7.2.2
The Coprocessor
Figure 7.6 shows the logic symbol for the coprocessor. Compared to the original implementation, only the signal fc was removed and signal Nregfin included. Signal fc corresponds to the function codes required by the MC68332 bus. Since the coprocessor does not access the main memory, there is no need to control this signal. On the other hand, fc will be used by the bus interface controller, during a block transfer between the coprocessor cache and the main memory. When Nregfin is asserted, it indicates the completion of one parameter passing between the coprocessor and the microcontroller. The bus interface controller uses this signal to assert the data strobe acknowledge Ndsack(1), indicating the completion of the data transfer, following the protocol in the MC68332 asynchronous bus operation.
140
Cache Memory Configuration
Figure 7.5. Flowchart for the cache memory model
Figure 7.6. Logic symbol of the coprocessor for the cache configuration
As we can see from Figure 7.3, the coprocessor address and data buses are connected to the internal bus. During parameter passing, these buses are connected to the main system bus by the bus interface controller.
7.2.3
Coprocessor Memory Controller
Similar to the one introduced in the previous chapter, this component controls the coprocessor accesses to the cache memory and its logic symbol is shown in Figure 7.7. The coprocessor generates an address of 24 bits (x addr), as if it is addressing the main memory1 . Assuming the initial cache size of 4K × 16 bits, proposed in Section 7.2.1, we see that the cache address is of 12 bits. Nevertheless,
System Organization
141
Figure 7.7. Logic symbol for the coprocessor memory controller
Figure 7.8. Virtual address for the cache memory configuration
since we allow for byte access, the cache can be seen as of 8K × 8 bits, which requires a 13-bit address (addr12 : 0). In a word access, odd addresses are not allowed, and in a byte access, even and odd addresses are. We follow the same procedure as the MC68332 bus operation: since the data bus is of 16 bits, a byte access to an even address puts the datum in the higher-order byte of the data bus, and a byte access to an odd address puts the datum in the lower-order byte of the data bus. Thus, we do not need addr0 for a memory access, since we always access a word. On the other hand, we use addr0 to select the low and high byte part of the cache through signals Nce0 and Nce1, respectively. So far, when sending the address to the cache, the memory controller is really sending the 12 higher-order bits of addr. Considering the concepts introduced in Section 7.1, the cache is organized into blocks and a cache access is limited to the access of a specific block. We use direct mapping for block placement, following the address structure of Figure 7.8. Assuming a cache size of CS bytes and a block size of BS bytes, there will be (CS/BS) blocks in the cache, each block requiring an address of (log2 (CS/BS)) bits. From Figure 7.8, we can state that:
142
Cache Memory Configuration
Figure 7.9. Cache directory
byte = log2 BS bits block = log2 (CS/BS) bits tag = 24 − (block + byte) bits
(7.1)
The address to the 8K × 8 bits cache is obtained from the concatenation of block and byte as addr12 : 0 = block & byte. In order to address the block, we have first to check if the block is present in the cache. This operation is done using a cache directory (Hwang and Briggs, 1985), shown in Figure 7.9, which contains the tag field, plus the dirty bit introduced in Section 7.1.2. The coprocessor memory controller, shown in Figure 7.10, addresses the cache directory using the block field from the virtual address provided by the coprocessor. The content of the cache directory is then compared with the tag field. If there is a hit, the block is present in the cache and the word is accessed. Otherwise, the block is not present and it has to be brought from the main memory by the bus interface controller. The controller remains in the initial state idle until a memory request takes place (Nmemreq = 0) or the coprocessor finishes the execution of the function (Ncopro dn = 0). In the first case, the controller starts a cache directory access to see if the required block is in the cache: if the block is not present, the
System Organization
Figure 7.10. Algorithmic state machine for the coprocessor memory controller
143
144
Cache Memory Configuration
controller enters state hit wait; otherwise, it enters state addr1. In the second case, the controller starts a cache directory access to see if any block has been modified in the cache and needs then to be updated in the main memory, in which case it enters state test. Notice that since idle is the initial state, some signals are initialized here. In state hit wait, the controller waits until the block is brought from the main memory by the bus interface controller. When this operation is accomplished, the bus interface controller asserts hit back. The memory controller then updates the cache directory and enters state addr1, to start the access to the word. In state addr1, the controller identifies the end of the memory access. If it is a byte or word access, the operation finishes and the controller enters state complete. If it is a long word access, another memory access is required, the address is incremented by 2 and the controller enters state addr2. In both cases, if it is a write operation (Nfpga wr = 0), the block dirty bit is set in the cache directory. State addr2 is used during a long word access. The cache memory chip enable signal Nce is negated, as soon as the controller enters this state, indicating the end of the first word access. This is required especially during a write, since we use Nce to control the end of this operation. During a read, the coprocessor buffers latch the input data. State data2 is used during a long word access, following addr2. In this state, the cache memory chip enable Nce is asserted and the coprocessor buffers receive the lower word of a long word access. State complete is the next state after a byte/word access, accomplished in state addr1, or a long word access, accomplished in state data2. In this state, signal Nmemfin is asserted, indicating the end of the memory access to the coprocessor, Nce is negated and the coprocessor buffers are closed. The controller enters state test from the initial state, once the coprocessor finishes executing the function. The cache directory is checked for blocks that were modified by the coprocessor (dirty = 1) and must be updated in the main memory, before the end of coprocessor operation is propagated. Signal dirty receives the value of the dirty bit, so that the bus interface controller knows that a block transfer from the cache to the main memory is required. Once a block transfer starts, the controller enters state wait. Notice that the cache directory is accessible only by the cache memory controller. Since the bus interface controller needs the tag field and the block address to locate the required block in the main memory, this information is passed by the cache memory controller through the interface signals tag and sblock. On the other hand, if the block was not modified (dirty = 0), the block address is incremented and the controller enters state control.
System Organization
145
The controller remains in state wait until the bus interface controller finishes the block transfer, which is indicated by the assertion of the signal dirty back. The cache directory is then updated, the block address is incremented and the controller enters state control. In state control, if there are more blocks to be checked, the controller returns to state test. Otherwise, it propagates signal Ncopro dn, through signal Ncopro dn del, and waits until Ncopro dn is negated by the coprocessor, as a result of the handshake completion. The VHDL model of the coprocessor memory controller consists of a process, with sensitivity list containing the main system clock (clk) and the system reset (Nreset). Once the reset signal is asserted at the beginning of the simulation, signals are initialized and the addressed bus set to high impedance state. The controller transitions from one state to another based on the rising edge of the clock.
7.2.4
The Bus Interface Controller
The bus interface controller, shown in Figure 7.11, is concerned with parameter passing, access to the control register and data transfer between the main memory and coprocessor cache memory, during a cache miss and a main memory update operation. In this sense, it has complete access to the main address and data buses, as well as to the internal address and data buses. The function codes signal fc is now included, due to the data transfers through the main system bus. Signals x size1 : 0, Naen, Nmemfin and Nmemreq are not required since they were related to the coprocessor main memory access, which is not taking place anymore. Signal Nregfin was included to control parameter passing between the microcontroller and coprocessor. Once Nregfin goes low, the bus interface controller asserts the data strobe acknowledge Ndsack(1), indicating to the microcontroller the end of parameter passing, as it is required in an asynchronous bus transfer. Four state machines are implemented: one for parameter passing control, one for interrupt control, one for block transfer control when there is a cache miss, and one for block updating control when the coprocessor finishes executing the function and there are blocks modified in the cache. The first two state machines are the same as in the original implementation. Figure 7.12 shows the algorithmic state machine for the block transfer control. The controller remains in state idle until a cache miss is detected (hit = 0), which is sent by the coprocessor memory controller. A block transfer from the main memory to the cache is then required. The controller remains in state req until the main bus is granted. Once the bus is granted, the block address is sent to the main address bus, as well as to the cache address bus. If the dirty bit of the block already in the cache is 1, it means that this block was modified by the coprocessor. Before block replacement, the dirty block must then be copied
146
Cache Memory Configuration
Figure 7.11. Logic symbol of the bus interface controller for the cache configuration
back into the main memory to preserve consistency. Therefore, the tag field from the cache directory and the block field from the address provided by the coprocessor are sent to the main address bus. The block field is also sent to the cache address bus and the controller enters state write. On the other hand, if the dirty bit is 0, the block transfer from the main memory to the cache can proceed. In this case, instead of the tag field from the cache directory, the main address bus receives the tag and block fields from the address provided by the coprocessor. The controller then enters state read. The controller remains in state write until a word of the block being transferred from the cache is written into the main memory. When signal havedsack is asserted by the main memory controller, the state machine enters state addr write. In state addr write, the controller checks if the whole block has been written into the main memory. In this case, the block transfer from the cache to the main memory finishes and the block transfer from the main memory to the cache must start. The main address bus receives the tag and block fields from the address provided by the coprocessor, and the cache address bus receives the
System Organization
Figure 7.12. Algorithmic state machine for the block transfer
147
148
Cache Memory Configuration
block field only. The controller then enters state read. On the other hand, if the block transfer has not finished, the addresses to the main memory and to the cache memory must be updated. The controller then returns to state write, in order to wait for another word to be written into the main memory. Once in state read, the controller waits for the word of the block being brought from the main memory to the cache to be read. As soon as signal havedsack is asserted by the main memory controller, the bus interface identifies the end of the memory access and enters state addr read. In state addr read, the controller checks if the whole block has been brought from the main memory to the cache. In this case, it asserts signal hit back, indicating to the coprocessor memory controller that the block transfer has finished, releasing the main bus, as well as the cache bus, and entering state complete. On the other hand, if the block transfer is not complete, the addresses to the main memory and to the cache memory must be updated, and the controller returns to state read. State complete is used to disable some control signals, while keeping signal hit back asserted. This gives some time for the coprocessor memory controller to identify the end of the block transfer, before hit back is negated in the next state. The controller then enters state idle. The VHDL model of the block transfer controller is implemented as a process, with sensitivity list containing the main system clock (clk) and the system reset (Nreset). Once the reset signal is asserted, at the beginning of the simulation, signals are initialized, while the main address bus, the main data bus, the cache address bus and the coprocessor data bus are set to high impedance state. The controller transitions from one state to another on the rising edge of the clock. Figure 7.13 shows the algorithmic state machine used to control the block updating procedure. The controller remains in state idle until the coprocessor finishes executing the function (Ncopro dn = 0) and there are modified blocks in the cache (dirty = 1). A block transfer from the cache to the main memory is then required and the state machine enters state req. Notice that the bus interface controller has a direct access to the coprocessor cache memory. In this sense, it has control over the cache control signals, in the same manner as the coprocessor memory controller. While the bus interface controller is in state idle, it is not requiring any memory access and must keep the cache control signals (Nwe, Nce) disabled, i.e., in the high impedance state. The controller remains in state req until the bus is granted (Ngrant = 0). It then sends the tag and block fields, provided by the coprocessor memory controller, to the main address bus, and the same block field to the cache address bus. The controller then enters state write. In state write, the controller waits until the word of the block being brought from the cache is written into the main memory. When signal havedsack is
System Organization
Figure 7.13. Algorithmic state machine for the block updating
149
150
Cache Memory Configuration
asserted by the main memory controller, the memory write finished and the controller enters state addr write. Notice that, at this moment, the bus interface controller has full control over the cache control signals (Nwe, Nce). There is no need for an arbitration, since when the bus interface controller is accessing the cache, the coprocessor memory controller is waiting the operation to complete, maintaining the cache control signals disabled. In state addr write, the controller checks if the whole block has been brought from the cache. In this case, the main and the cache buses are released, and signal Ndirty is asserted. The controller then enters state complete. Otherwise, the main and the cache addresses are updated and the controller returns to state write. In state complete, the controller negates the bus request (Nrequest = 1). This state is used to keep signal Ndirty asserted, so that the coprocessor memory controller can identify the end of the block transfer and negate signal dirty. As soon as the controller returns to state idle, signal Ndirty is negated. The VHDL model of the block updating controller is implemented as a process with sensitivity list containing the main system clock (clk) and the system reset (Nreset). Once the signal reset is asserted, at the beginning of the simulation, signals are initialized and the main address bus, main data bus, cache address bus and coprocessor data bus are set to high impedance state. The controller changes from one state to another on the rising edge of the clock.
7.3
Timing Characteristics
The new VHDL model was validated using the LEAPFROG simulator and the CWAVES wave viewer (Systems, 1995). The same example used in Chapters 5 and 6 (see Figure 5.1) to validate the model and provide all the details to analyze the behavior of the system is used here. We start with 10 iterations, 1 coprocessor internal operation per iteration and 1 coprocessor memory access per iteration. The read and write operations to the shared memory, as well as to the cache memory, were successfully performed. The times related to the microcontroller accesses to the control register and to the main memory are the same as in the original implementation. Parameter passing is still performed through memory-mapped registers. Figure 7.14 shows a coprocessor cache memory write, starting as soon as Nmemreq is asserted by the coprocessor2 . The cache address bus is configured with a valid address one clock cycle after Nmemreq is asserted. The cache memory3 chip enable signal Nce indicates the moment when the write operation starts, controlling its duration. Data must be available before and after the rising edge of Nce, according to the description given in Section 7.2.1. The write enable signal Nwe indicates the type of the operation. We know that the
Timing Characteristics
151
Figure 7.14. Coprocessor cache memory write
coprocessor cache memory access finishes between 0.5 and 1.5 clock cycles after Nmemfin is asserted4 , when its internal signal Mem fin is then asserted. We can then summarize the coprocessor cache memory accesses as: coprocessor cache memory read (16 bits): 180ns to 240ns (3 to 4 clock cycles) coprocessor cache memory write (16 bits): 180ns to 240ns (3 to 4 clock cycles) Figure 7.15 shows the beginning of a block transfer from the main memory to the cache when there is a cache miss (hit = 0), indicated by the coprocessor memory controller (see Figure 7.10, in Section 7.2.3). As soon as the bus interface controller identifies the cache miss, it requires the main bus as the initial procedure for the block transfer (see Figure 7.12 of Section 7.2.4). Signal hit is negated 1 clock cycle after the memory request takes place, when a valid address to the cache memory should be sent instead. The bus is then requested 2 clock cycles later, starting the bus arbitration process, which
152
Cache Memory Configuration
Figure 7.15. Bus arbitration when there is a cache miss
follows the same procedure as described in Section 5.2.2, Section 5.2.3 and Section 5.2.4 for the original implementation. Defining the bus arbitration cycle as BA, the times corresponding to each interface mechanism are: bus arbitration cycle with busy-wait (BAbw ): 180ns to 360ns (3 to 6 clock cycles) bus arbitration cycle with interrupt (BAint ): 180ns (3 clock cycles) Figure 7.16 shows a word of a block being transferred from the cache to the main memory. Besides the times corresponding to read/write operations either from the microcontroller or the coprocessor, we now have block transfers between the main memory and the coprocessor cache memory5 . Considering the moments when the address lines change, the times obtained are as below:
153
Timing Characteristics
Figure 7.16. Transferring a word from the cache to the main memory
transferring a word from the main memory to the cache: 240ns (4 clock cycles) transferring a word from the cache to the main memory: 240ns (4 clock cycles) Since the block size is of BS bytes, we can say that there are BS/2 words in the block. Therefore, we define the block write time BW as: BW = 4 × (BS/2)
(7.2)
BW = 2 × BS
(7.3)
Figure 7.17 shows the end of a block transfer for a cache miss, when the main bus is released and Nbgack is negated one clock cycle later. Considering
154
Cache Memory Configuration
the delay of one clock cycle between the assertion of Nbgack and the access to the address bus, at the beginning of the block transfer, and the block write time BW above, the block transfer time BTR, during a cache miss, can be defined as in (7.4). BT R = 2 + BW
(7.4)
The coprocessor cache memory access can then proceed, when the cache address bus is configured with a valid address. Notice that, if there was no cache miss, the cache address bus would have been accessed within one clock cycle from the moment Nmemreq was asserted (see Figure 7.14). Signal Nmemfin is asserted 1.5 clock cycles after the cache memory accessed started, indicating the end of the operation. The coprocessor identifies the end of the memory access when signal Mem fin is asserted, between 0.5 and 1.5 clock cycles after Nmemfin (see Section 5.3.3). When a cache miss occurs, besides bringing the missing block from the main memory, the replaced block in the cache may have to be written back into the main memory if this block was modified in the cache. We use the write back policy, introduced in Section 7.1.2, in which the modified cache block is written back to the main memory only when it is replaced. Instead of bringing the missing block from the main memory after signal hit is asserted, the cache block being replaced is written back, taking the time corresponding to a block write (BW). As soon as this transfer finishes, the missing block is then transferred to the cache. Signal hit back is asserted only by the end of this last transfer. Notice that, in spite of having two block transfers, only one bus arbitration is required. The increase in the block replacement time is due to the second block transfer time. Two possible elapsed times for the block replacement can now be obtained: one when there is no need for write back and another when write back is needed. A block replacement starts with the assertion of signal hit. Considering, in Figure 7.15, Figure 7.16 and Figure 7.17, the delay between hit and Nbr (2 clock cycles), plus the bus arbitration cycle (which depends on the interface mechanism), plus the delay between Nbgack and the main address bus access (1 clock cycle), plus the time to write the whole block into the main memory (BW), if write back is required, and/or the time to write the whole block into the cache (BW), plus the delay between the end of the block transfer and the assertion of Nbgack (1 clock cycle), we define the block replacement time BR as: BR = 2 + BA + 1 + (2 × BW or BW ) + 1
(7.5)
BR = 4 + BA + (BW or 2 × BW )
(7.6)
Timing Characteristics
155
Figure 7.17. End of the block transfer during a cache miss
Since the bus arbitration cycle can vary between 3 and 6 clock cycles, we obtain the following values for BR: 7 + (BW or2 × BW ) ≤ BR ≤ 10 + (BW or 2 × BW )
7.3.1
(7.7)
Block Transfer During Handshake Completion
We define the handshake completion HC as the time between the coprocessor end of operation (Ncopro dn = 0) and the acknowledgement by the microcontroller (Ncopro st = 1). When the coprocessor finishes its operation, there may be blocks in the cache that were not updated in the main memory, yielding the problem of consistency. To avoid this, all the blocks in the cache must have their dirty bit checked, as described in Figure 7.10 of Section 7.2.3. If there is a block to be updated in the main memory, the coprocessor operation has
156
Cache Memory Configuration
not finished yet, in spite of the function having been executed. Therefore, signal Ncopro dn is not propagated to the microcontroller, until all the blocks are updated into the main memory. Once this done, the microcontroller receives the delayed version of Ncopro dn, through signal Ncopro dn del. The time related to the handshake completion can now be obtained by adding the block updating time (BU) to the original handshake completion time. The block updating time corresponds to the time interval between the assertion of Ncopro dn and the assertion of Ncopro dn del. It depends on the time spent in checking all the blocks in the cache directory (block test BT), the bus arbitration cycle (BAbw or BAint ) for each block transfer required and the time spent in transferring dirty blocks into the main memory (block transfer BTR) : Figure 7.18 shows when the coprocessor finishes operation (Ncopro dn = 0). Based on the algorithm presented in Figure 7.10, the cache directory is then searched for dirty blocks. Since our example performs a total of 10 coprocessor cache memory accesses, each of which to a different location, and considering the initial block size of 512 bytes, we can see that block 0 is the only block being written into and thus needs to be updated into the main memory. The block dirty bit is tested 2 clock cycles after Ncopro dn is asserted: BT = 2
(7.8)
Since dirty = 1, the main bus is requested 2 clock cycles later and the arbitration starts. If the block is not dirty, the test of the next block proceeds immediately afterwards, without a bus request. Figure 7.19 shows when block 0 has been updated into the main memory and signal dirty back is then negated. This operation is carried out by the bus interface controller and is described in Figure 7.13 of Section 7.2.4. The coprocessor memory controller is waiting for the negation of dirty back, in order to negate dirty, which is done one clock cycle later. At the same time, the bus is released by the bus interface controller (Nbgack = 1) and a new block is checked6 . Figure 7.20 shows when the last block has been tested (sblock = 0xF). The coprocessor memory controller then propagates Ncopro dn through Ncopro dn del, 1 clock cycle later. Although all blocks must be checked in the cache directory, only dirty blocks require updating. Defining the number of dirty blocks as DB and considering, in Figure 7.18, Figure 7.19 and Figure 7.20, the block test time (BT), plus the delay to request the bus (2 clock cycles), plus the bus arbitration cycle (BA), plus the block transfer time (BTR), plus the delay to assert Ncopro dn del (1 clock cycle), we can now reformulate the equation for the block updating time (BU) as:
157
Timing Characteristics
Figure 7.18. End of coprocessor operation and beginning of block update
BU = (BT ×(CS/BS))+2×DB +(BA×DB)+(BT R×DB)+1 (7.9) BU = ((BT × (CS/BS)) + 1) + (2 + BA + BT R) × DB
(7.10)
For the microcontroller, the handshake completion starts now, since the bus interface controller receives Ncopro dn del, instead of Ncopro dn. Depending on the interface mechanism applied, we can have the following times, as described in Section 5.2.3: handshake completion with busy-wait: 300ns to 540ns (5 to 9 clock cycles) handshake completion with interrupt: 11160ns (186 clock cycles) The handshake completion for the cache configuration is then given in (7.11). HCcache = HCoriginal + BU
(7.11)
158
Cache Memory Configuration
Figure 7.19. End of transfer of block 0
Once Ncopro st is negated, the coprocessor then negates Ncopro dn 4 clock cycles later. Following this, the coprocessor memory controller negates Ncopro dn del 1 clock cycle after Ncopro dn.
7.4
Performance Results
As described in the previous sections, the bus arbitration cycle, block size and cache size are directly associated to the time spent by the coprocessor in executing the function. Besides the coprocessor memory accesses, we are now concerned with the number of blocks required by the application. The number of memory accesses may be large, but it does not determine by itself the performance of the system. Since we use a data-only cache (see Section 7.1.2), the requirements of the application, in terms of arrays sizes, and the characteristics of the cache, in terms of number of words and block size, must now be considered.
Performance Results
159
Figure 7.20. Handshake completion after block updating
Once again, we use example of Figure 5.1 of Chapter 5 as our benchmark, for different values of iterations and accesses, which determines the total number of memory accesses. Parameter operations is not considered here, since the memory access rate does not affect the behavior of the system in relation to the chosen interface mechanism. In this new implementation, we vary the block size to determine the relation between this parameter and the coprocessor memory accesses with the execution time of the application. The two interface mechanisms are examined, since they determine the bus arbitration cycle and the handshake process. Another set of simulation results are obtained and compared to the previous ones, in order to identify any performance improvement.
7.4.1
Varying the Number of Addressed Locations
In varying the parameter iterations we are really varying the number of different locations that can be addressed. Table 7.1 shows the execution times using
160
Cache Memory Configuration
Table 7.1. Performing 10 operations and 1 memory access per iteration with BS = 512 bytes accesses 0 1 2 3 4 5 10 20 90 100 200 300 400 500 600 700 800 900 1000
Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) miss rate 10740 – – – 7920 18660 1.00 10320 129480 129060 137400 147720 142500 152880 10380 5100 5160 0.50 5280 0.33 147660 158160 10500 5160 163440 10440 5340 5280 0.25 153000 158400 168720 10320 5400 5280 0.20 5280 0.10 184800 195120 10320 5400 52800 0.05 237600 247920 10320 52800 617520 10320 52800 52800 0.01 607200 660000 670320 10320 52800 52800 0.01 528000 0.00 10320 528000 1188000 1198320 0.01 1839540 1850100 10560 651540 651780 2367540 2378100 10560 528000 528000 0.00 10560 528000 528000 0.00 2895540 2906100 3547440 3557880 10440 651900 651780 0.00 4075440 4085880 10440 528000 528000 0.00 4727220 4737660 10440 651780 651780 0.00 5255220 5265660 10440 528000 528000 0.00 10440 528000 528000 0.00 5783220 5793660
the busy-wait (Tbw ) and the interrupt (Tint ) mechanisms, with the coprocessor performing 10 internal operations (operations = 10) and 1 memory access per iteration (accesses = 1), which means that each location is addressed only once. The cache size (CS) is of 4K words and the block size (BS) is of 512 bytes, which provides a total of 16 blocks. The initial times Tbw (0) and Tint (0) correspond to the time necessary to run the coprocessor, without executing the function’s main loop (see Figure 5.1, in Chapter 5), which means the time spent in parameter passing, some coprocessor internal operations (to deal with the loop controlled by iterations) and completion of the handshake, for each interface mechanism. Although there are neither blocks to be brought from the main memory to the cache nor from the cache to the main memory, the search of the cache directory for dirty blocks is still carried out. In the previous implementations, the initial times were 5820ns, for busy-wait, and 16680ns, for interrupt (see Section 5.3.1 and Section 6.4.1). The difference between the initial times is directly related to the test of the blocks, during block updating time. The block updating time, when iterations = 0, for both interface mechanisms, is defined as: BU (0) = 33 clock cycles
(7.12)
161
Performance Results
For the busy-wait mechanism, we can see that between the previous value (5820ns) and the present one (7920ns) there is a difference of 35 clock cycles. We conclude that this difference is not only related to the block updating time, but also to the difference between their handshake completion, which can vary from 5 to 9 clock cycles. Since for the interrupt mechanism the handshake completion is constant (186 clock cycles), the difference between the previous value (16680ns) and the present one (18660ns) corresponds exactly to the block updating time. Since our example addresses 16-bit data only, there is a direct relation between the addressed locations and the block size. For 1 ≤ iterations ≤ 256 there is only one block replacement and one block update to be performed, thus requiring 2 bus arbitration cycles. The equation for the execution time can be formulated as below7 : T (iterations) = T (0) + (BR + (BU − (BT × (CS/BS) + 1))) × 60 + 5280 × iterations (7.13) Notice that we subtract the block test time and the delay between the end of the block transfer and the assertion of Ncopro dn del (1 clock cycle) from the block updating time, because it is already included in T (0). The coefficient of iterations corresponds to the average coprocessor cache memory access cycle, when using the busy-wait cycle, which includes the 10 internal operations, the single memory write performed in one iteration and internal control loop calculations. Considering that only one block is being used and there is no write back during block replacement, we obtain the following equation for the execution time: T (iterations) = T (0) + ((4 + BA + 2 × BS) + (2 + BA+ × (2 + 2 × BS))) × 60 + 5280 × iterations T (iterations) = T (0) + (2 × BA + 4 × BS + 8) × 60 + 5280 × iterations
(7.14)
(7.15)
The difference between the two interface mechanisms is: Tint − Tbw = Tint (0) − Tbw (0) + (2 × BAint − 2 × BAbw ) × 60 Tint − Tbw = 10740 + (BAint − BAbw ) × 120
(7.16) (7.17)
which depends on the difference between their handshake completions (10740ns) and their bus arbitration cycles, which both depend on the interface mechanism used. Recall that the difference between the handshake completions
162
Cache Memory Configuration
could be between 177 and 181 clock cycles, while the difference between the bus arbitration cycles can be between 0 and 3 clock cycles (see Section 5.3). For 257 ≤ iterations ≤ 512 there are two block replacements to perform (without write back) and two dirty blocks to update in the main memory (recall that our example performs memory writes only), thus requiring 4 bus arbitration cycles. The execution times, for both interface mechanisms, is: T (iterations) = T (0) + (4 × BA + 8 × BS + 16) ×60 + 5280 × iterations
(7.18)
We see that the difference between them corresponds to the difference between the bus arbitration cycles and the handshake completions, as in the previous analysis. Since we have one memory access per iteration, the miss rate is not significant, in spite of increasing the number of addressed locations.
7.4.2
Varying the Block Size
Table 7.2 shows the execution times for each interface mechanism, varying the number of addressed locations (iterations), performing 10 coprocessor internal operations and 1 memory access per location (accesses = 1). Although we have the same simulation parameters as for the last section, we use now a block size BS of 256 bytes, which provides 32 blocks in the cache. Comparing the new values of Tbw (0) and Tint (0) with the previous ones, when BS = 512, we obtain: Tbw (0)BS=256 − Tbw (0)BS=512 = 1800ns (30 clock cycles)
(7.19)
Tint (0)BS=256 − Tint (0)BS=512 = 1920ns (32 clock cycles)
(7.20)
This corresponds to the difference between the block updating times (BU) and the handshake completions for each interface mechanism. The new block updating time is the same for both interface mechanisms: BU (0)BS=256 = 65 clock cycles
(7.21)
The handshake completion shows some difference only in the busy-wait, since it can vary between 5 and 9 clock cycles. Considering the interrupt mechanism, which always gives the same handshake completion times, we can say that the difference between the initial times corresponds exactly to the difference between their block updating times: BU (0)int,BS=512 − BU (0)int,BS=256 = 32 clock cycles
(7.22)
163
Performance Results
Table 7.2. Performing 10 operations and 1 memory access per iteration, with BS= 256 bytes iterations 0 1 2 3 4 5 9 10 20 90 100 200 300 400 500 600 700 800 900 1000
Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) miss rate 9720 20580 10860 – – – 88200 10440 68040 67620 1.00 77760 82860 93360 10500 5100 5160 0.50 98640 10620 5160 5280 0.33 88020 93360 103920 10560 5340 5280 0.25 98760 109200 10440 5400 5280 0.20 119760 130320 10560 5340 5280 0.11 0.10 125160 135600 10440 5400 5280 0.05 177960 188400 10440 52800 52800 547560 558000 10440 52800 52800 0.01 600360 610800 10440 52800 52800 0.01 10140 590640 590340 0.01 1191000 1201140 1781100 1791480 10380 590100 590340 0.01 590640 590340 0.01 2371740 2381820 10080 2899740 2909820 10080 528000 528000 0.01 3489900 3500160 10260 590160 590340 0.01 4080540 4090500 9960 590640 590340 0.01 4671180 4680840 9660 590640 590340 0.01 5261520 5271180 9660 590340 590340 0.01 528000 528000 0.01 5789520 5799180 9660
In spite of changing the block size, the equations formulated in the previous section are still valid, but now applied to different intervals of iterations, i.e. 1 ≤ iterations ≤ 128, corresponding to multiples of the block size. Observe also that the coefficient of iterations does not change, as the coprocessor cache memory cycle is independent of the block size. Table 7.3 shows another set of execution times for both interface mechanisms using a block size of 64 words (BS = 128 bytes). Following the same analysis, the difference between the initial execution times for BS = 512 and BS = 128 is: Tbw (0)BS=128 − Tbw (0)BS=512 = 5700ns (95 clock cycles)
(7.23)
Tint (0)BS=128 − Tint (0)BS=512 = 5760ns (96 clock cycles)
(7.24)
The new block updating time is the same for both interface mechanisms: BU (0)BS=128 = 129 clock cycles
(7.25)
Considering, again, the interrupt mechanism, which always presents the same handshake completion, we can say that the difference between the initial times corresponds exactly to the difference between their block updating times:
164
Cache Memory Configuration
Table 7.3. Performing 10 operations and 1 memory access per iteration, with BS = 128 bytes iterations 0 1 2 3 4 5 9 10 20 90 100 200 300 400 500 600 700 800 900 1000
Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) miss rate 13620 24420 10800 – – – 37320 36900 1.00 50940 61320 10380 56040 66480 10440 5100 5160 0.50 5160 2280 0.33 61200 68760 7560 66540 74040 7500 5340 5280 0.25 71940 79320 7380 5400 5280 0.20 5340 5280 0.11 92940 100440 7500 5400 5280 0.10 98340 105720 7380 151140 158520 7380 52800 52800 0.05 552060 562740 10680 52800 52800 0.02 52800 52800 0.02 604860 615540 10680 591600 591240 0.02 1196460 1206780 10320 1756020 1766400 10380 559560 559620 0.02 0.02 10380 591240 591240 2347260 2357640 2907180 2917260 10080 559920 559620 0.02 3498420 3508500 10080 591240 591240 0.02 4058340 4068120 9780 559920 559620 0.02 4649580 4659360 9780 591240 591240 0.02 9780 591240 591240 0.02 5240820 5250600 0.02 9480 559920 559620 5800740 5810220
BU (0)int,BS=512 − BU (0)int,BS=128 = 96 clock cycles
(7.26)
Once more, the equations for Tbw and Tint depend on the interval of values for iterations, which is based on multiples of the block size, i.e. 1 ≤ iterations ≤ 64.
7.4.3
Varying the Number of Memory Accesses
As we know already, in varying the parameter accesses we are varying the number of memory accesses to the same addressed location in the memory. Combined with parameter iterations, we have the total number of memory accesses. In the case of the cache configuration, accesses controls the number of memory accesses on the same block. This means that, in spite of increasing the total number of memory accesses, we are not increasing either the block replacement or the block updating times. Table 7.4 shows the execution times for both interface mechanisms based on accesses, with the coprocessor performing 10 iterations and 10 internal operations per iteration. Since iterations corresponds to the number of different
165
Performance Results Table 7.4. Performing 10 iterations and 10 internal operations, with BS = 512 bytes accesses 0 1 2 3 10 20 30 100 200 300 400 500 600 700 800 900 1000
Tbw (ns) Tint (ns) Tint − Tbw Tbw (i) − Tbw (i − 1) Tint (i) − Tint (i − 1) miss rate 52320 63060 10740 – – – 195240 10440 132480 132180 1.00 184800 192000 202440 10440 7200 7200 0.50 209640 10440 7200 7200 0.33 199200 249600 260040 10440 7200 7200 0.10 321600 332040 10440 72000 72000 0.05 393600 404040 10440 72000 72000 0.03 0.01 897600 908040 10440 72000 72000 0.00 1617600 1628040 10440 720000 720000 2337600 2348040 10440 720000 720000 0.00 3057600 3068040 10440 720000 720000 0.00 10440 720000 720000 0.00 3777600 3788040 4497600 4508040 10440 720000 720000 0.00 720000 720000 0.00 5217600 5228040 10440 5937600 5948040 10440 720000 720000 0.00 6657600 6668040 10440 720000 720000 0.00 7377600 7388040 10440 720000 720000 0.00
addressed locations, we conclude that only one block replacement and one block update is required, in spite of the number of memory accesses. Each memory access introduces an extra 7200ns in the execution times. Since we are performing 10 iterations, we can conclude that the time interval between two consecutive memory accesses is 720ns. The time interval between the last memory access in one iteration and the first one in the next iteration is 5280ns (see Table 7.1 of Section 7.4.1). Considering (7.15)8 , we can express the execution times in terms of accesses and iterations as in (7.27): T (iterations, accesses) = T (0, accesses) + (2 × BA + 4 × BS + 8) × 60 + (5280 + (accesses − 1) × 720) × iterations
(7.27)
Equation (7.27) is the same for any value of accesses, but it changes according to the value of iterations. This can be seen from (7.15) and (7.18), which correspond to different ranges of iterations. This is due to the fact that iterations affect block replacement and block updating times, since it is related to the number of different addressed locations in the cache. The miss rate is not significant, in spite of increasing the number of memory accesses. When increasing accesses we are increasing the number of memory accesses to the same addressed location. Since the number of addressed
166
Cache Memory Configuration
locations is given by iterations, i.e. 10, there will be only one miss, for each executed simulation.
7.4.4
Speedup Achieved
The speedup achieved S is based on the comparison of the original singleport shared memory configuration with the new one including the coprocessor cache memory. The results obtained in the previous sections are used in order to analyze the performance improvement yielded by this new configuration, in terms of execution times. Considering (5.38), (5.40) and (7.27), which are in terms of accesses and iterations, the speedup achieved, using the busy-wait (Sbw ) and interrupt (Sint ) mechanisms, can be defined as in (7.28) for the busy-wait and (7.29) for the interrupt mechanism. Sbw (accesses, iterations)single/cache =
Sbw (accesses, iterations)single Sbw (accesses, iterations)cache (7.28)
Sint (accesses, iterations)single/cache =
Sint (accesses, iterations)single Sint (accesses, iterations)cache (7.29)
As the number of coprocessor memory accesses increases, the speedup for each interface mechanism applied approaches an upper bound determined in (7.30) for the busy-wait and (7.31) for the interrupt mechanism. lim Sbw (accesses, iterations)single/cache = 1.75 (7.30) iterations → ∞ accesses → ∞ lim Sint (accesses, iterations)single/cache = 1.67 (7.31) iterations → ∞ accesses → ∞ The above analysis indicates that as the number of memory accesses increases the speedup is determined by the memory access cycle. The coprocessor memory access cycle corresponds to the time interval between two consecutive memory accesses. For the single-port shared memory, the coprocessor memory access cycle is 1260ns, using busy-wait, and 1200ns, using interrupt (see Section 5.3.3). These times include the bus arbitration (which depends on the interface), the main memory access time and the calculation of the next address. For the cache memory, the coprocessor memory access cycle is 720ns (see
Performance Results
167
Table 7.5. Performing 10 internal operations and 1 memory access per iteration, with BS = 512 bytes and random address locations iterations Tbw (ns) Tint (ns) Tint − Tbw Miss rate 0 7920 18660 10740 – 1 137100 147600 10500 1.00 10 1.00 1296600 1306380 9780 5160 0.91 90 10600140 10605300 100 11515500 11520420 4920 0.89 500 52330620 52319460 0.81 −11160 1000 103411140 103379940 −31200 0.80 5000 492415380 492249540 −165840 0.76
Section 7.4.3), independently of the interface mechanism. This time includes the cache memory access time and the calculation of the next address, which is the same for both implementations. Therefore, we conclude that the interface mechanism and the memory access time play a key role in the performance of the system, in terms of execution times as in the previous cases.
7.4.5
Miss Rate with Random Address Locations
The example presented in Figure 5.1 (see Section 5.1) generates a continuous sequence of addresses, based on the number of iterations. Therefore, the probability of generating an address that accesses a block which is already in the cache is very high (greater than 99% in our examples). This is characterized by the low miss rates obtained in the previous sections. Figure 7.21 shows another example, in which the addresses are generated randomly. The first address is passed to the function through the parameter accesses. Random addresses are then generated during the subsequent iterations, using the operations inside the loop controlled by accesses. Table 7.5 shows the execution times for both interface mechanisms based on iterations, with the coprocessor performing 10 internal operations and 1 memory access per iteration. This is similar to the experiment that generated the results given in Table 7.1. The difference is that the addresses are now generated randomly, instead of sequentially. Observing the miss rate s obtained, we can see that as the number of memory accesses increases the miss rate decreases. This means that the probability of finding the generated address in a cache block increases with the number of memory accesses performed by the coprocessor. Another observation is that the behavior of the execution times changes, according to the interface mechanism employed. We can see that the busy-wait
168
Cache Memory Configuration
typedef short int array1 [10000]; void example (array1 table, short int iterations, short int operations, short int accesses) { short int count, index; short int temp = 0; for (count = 0; count < iterations; count++) { for (index = 0; index < operations; index++) temp += 1; if (accesses > 0) { accesses *= 1845; accesses &= 16383; table [accesses] = temp; } } } array1 table; short int iterations = 10; short int operations = 1; short int accesses = 1; main () { example (table, iterations, operations, accesses); exit (0); } Figure 7.21. C program example, with random address locations
mechanism offers shorter execution times than the interrupt mechanism when the number of memory accesses is small. After a certain number of memory accesses, the interrupt mechanism becomes the better option. In spite of reducing the miss rate, the number of blocks to be transferred from the main memory to the cache increases compared to the previous example. Hence, there are more block replacements to be performed, which leads to a larger number of bus accesses. The time spent in the interrupt acknowledge cycle does not depend on the number of memory accesses, since it is related to the end of execution of the function, and occurs only once. On the other hand, bus arbitration takes place during every block replacement, because the shared bus must be accessed
169
Summary
Table 7.6. Performing 10 operations and 1 memory access per iteration, with BS = 128 bytes and random address locations iterations Tbw (ns) Tint (ns) Tint − Tbw Miss rate – 0 7920 18660 10740 1 50008 61200 11192 1.00 10 382320 391800 9480 1.00 100 3153540 3060 0.83 3150480 500 15061620 15048900 −12720 0.80 1000 29423880 29391300 0.77 −32580 5000 145425900 145253700 −172200 0.77
so that the coprocessor can transfer a block from the shared memory into the cache. Notice that only one bus arbitration is required per block replacement, because the bus is released by the coprocessor only after the block transfer is complete. Since the bus arbitration overhead with the busy-wait mechanism can be equal to or greater than that obtained with the interrupt mechanism, the execution times using the busy-wait mechanism increase faster than when using the interrupt mechanism, as the number of memory accesses increases. Table 7.6 presents the results obtained when using a block size of 64 words. We can see that the execution times are smaller (around 29%) than those obtained when using a block size of 256 words. This is due to the fact that the time spent during block transfer is shorter, since the blocks are smaller, in spite of having more blocks to transfer. The miss rate also decreases, which can be explained by the fact that we can have more blocks in the cache than before. Therefore, the probability of generating an address that accesses a block which is already in the cache increases, as the number of memory accesses increases. We can conclude that with our workload the only advantage in reducing the block size is in the reduction of the execution times, which is quite significant, since the miss rate does not decrease in a similar proportion (6% at most). We should keep in mind that we are now generating addresses randomly and that the principle of locality does not apply in this situation. Hence, we now have the opposite case compared to the previous example, where the addresses were generated sequentially, leading to “maximum” locality between consecutive memory addresses.
7.5
Summary
This chapter presented a new configuration for the original co-design system, introducing a coprocessor cache memory. The aim was to improve the performance of the whole system, by avoiding bus contention during the coprocessor memory accesses. The speedups achieved show that the execution times can be
170
Cache Memory Configuration
further reduced, compared to those obtained with the dual-port configuration, described in the previous chapter. We expressed the execution times in terms of the bus arbitration time, the block size and the total number of coprocessor cache memory accesses, for the busy-wait and the interrupt mechanisms. We noticed that this relationship depends on the number of coprocessor cache memory accesses to different locations and the block size, since the number of block replacements and block updating change with these two parameters. Consequently, the miss rate also depends on the block size, the number of coprocessor cache memory accesses to different locations and the “locality” of the memory references. In the next chapter, we present the overall conclusions on this work and some ideas for new and new trends in co-design.
Summary
171
Notes 1 Here, we consider the main address bus bandwidth. 2 The coprocessor can be identified by the symbols C2:U2 just before the signal name. 3 The cache memory can be identified by the symbols C2:U4:U6 just before the signal name. 4 The relation between Nmemfin and the end of coprocessor memory access was already discussed in Section 5.3.3, and can be seen in Figure 5.9. 5 Signals prefixed by C2:U4:U6 belong to the cache memory and signals prefixed by U6:U6 belong to the main memory. 6 All the signal values correspond to the position of the cursor in the figures. 7 We define only one equation for both interface mechanisms because they have the same coefficient for iterations, differing only in the bus arbitration time BA. 8 This equation applies for BS = 512 and 1 ≤ iterations ≤ 256.
Chapter 8 ADVANCED TOPICS AND FURTHER RESEARCH
The objective of the co-design system developed is to speed up a software application. Therefore, from a system specification, written in C or C++, a performance critical region may be selected for hardware implementation. The initial specification is then partitioned into a hardware sub-system, implementing the critical region, and a software sub-system, executing the application. The target architecture consists of a microcontroller, global memory, controllers and coprocessor. The aim of this research was to analyze the interface between these two sub-systems and to determine how the performance of the whole system is affected by the type of interface used. In this chapter, we draw some conclusions based on our experimental work and present recommendations for further research as well as advanced topics in co-design.
8.1
Conclusions and Achievements
Initially, the interface between the hardware and software sub-systems was based on the busy-wait mechanism. This offered an interface with a simple protocol and a low cost. On the other hand, the software sub-system was completely dedicated to the application, without the chance to explore any possible parallelism. Instead of keeping the microcontroller dedicated to the implementation of the busy-wait loop, we included the interrupt mechanism. Considering, the two interfaces available, we ran two benchmark programs (PLUM and EGCHECK) in the co-design system and obtained their execution times, which could be compared to their software-only implementation. For 173
174
Advanced Topics and Further Research
PLUM, the execution time of the software-only implementation was 6836ms and that of the mixed implementation was 2234ms, when using the busy-wait, and 2319ms, when using the interrupt. This gave speedups of 3.1 and 3.0, respectively. For EGCHECK, the execution time of the software only implementation was 10917ms and that of the mixed implementation was 6193ms, when using the busy-wait, and 5518ms, when using the interrupt. This gave speedups of 1.8 and 2.0, respectively. When comparing the speedups obtained for the two benchmarks, we see that the first program yields higher speedup when using the busy-wait mechanism, whilst the second one yields higher speedup when using the interrupt mechanism. PLUM contains 800,000 coprocessor memory accesses, while EGCHECK has 6.4 million coprocessor memory accesses. The memory accesses are directly associated with the problem of bus contention, in a shared bus architecture. We conclude that the busy-wait mechanism is better than the interrupt mechanism, when the coprocessor realizes more internal operations than memory accesses. On the other hand, the interrupt mechanism is better than the busy-wait mechanism, when the coprocessor realizes more memory accesses than internal operations. In order to determine the most suitable mechanism for an application, that will provide the best system performance, we concentrated our analysis on the number of coprocessor memory accesses and the memory access rate. AVHDL model of the co-design system was developed in order to provide the means for analyzing the system in a variety of ways. Instead of the benchmark programs used with the physical implementation, a case study was proposed, which allowed us to vary the number of coprocessor memory accesses, as well as the rate of such memory accesses. A memory access includes bus arbitration. We showed that the bus arbitration time with the busy-wait mechanism may be twice that corresponding to the interrupt mechanism. This explains the results obtained for the benchmarks. On the other hand, we showed that the handshake completion time for the interrupt mechanism is very much greater than that associated with the busy-wait mechanism. Nevertheless, the handshake completion occurs only once during the execution of an application, i.e., when the coprocessor finishes executing the hardware function. Based on the simulation results, using the VHDL model of the co-design system, we successfully related the coprocessor memory accesses and the interface mechanism. The difference between the execution times using busy-wait and interrupt is expressed in terms of the number of memory accesses, the difference between their bus arbitration times and the difference between their handshake completion times. Furthermore, we concluded that the difference between the execution times does not depend on the coprocessor memory access rate. As the performance of the shared bus system depends, amongst other parameters, on bus contention, we investigated the impact of the utilization
Conclusions and Achievements
175
of a dual-port memory in place of the single-port shared memory. This allowed us to eliminate the occurrence of bus contention. The VHDL model was altered to reflect the new architecture. As before, for both of the interface mechanisms, we expressed the execution time in terms of the number of coprocessor memory accesses. The difference between the execution times using the busy-wait mechanism and those using the interrupt mechanism is constant and is equal to the difference between their handshake completion times. This shows that, in this new configuration, the performance of the system depends on the interface mechanism only. Moreover, we established an upper bound of 1.50 and 1.43 for the speedups obtained with the busy-wait and interrupt mechanisms, respectively, in relation to the single-port shared memory system. The single-port shared memory, of the original implementation, consisted of two 1M × 8 bits DRAM. In spite of replacing this global memory by a dual-port memory for simulation purposes, we do not find dual-port dynamic memories available in the market, only static ones and in a very restricted number of different internal organizations. Therefore, for implementation purposes, we could consider the use of the dual-port memory for a portion of the address space and constrain the coprocessor memory accesses to this space. The cost of a 4K × 8 bits dual-port is approximately four times the cost of an equivalent single-port and one third the cost of 1M × 8 bits DRAM. Furthermore, we can say that there was no significant increase in the logic, due to the addition of more functionality to the coprocessor, in order to access the dual-port memory. As discussed in Section 6.2.3 and Section 6.2.4, the coprocessor memory access controller in the bus interface logic was transferred to another component (coprocessor memory controller). This can be seen as just a functional reorganization, not incurring in extra costs. Therefore, the extra cost due to the use of a dual-port memory seems plainly compensated by the performance improvement expected. The other alternative architecture consisted of employing a cache memory for the coprocessor. Again, the VHDL model was updated to include this new feature. Based on our simulation results, we related the execution times to the number of coprocessor memory accesses, bus arbitration time and cache block size, for both of the interface mechanisms. The difference between the execution times for the busy-wait and interrupt mechanisms depended on the difference between their handshake completion times and their bus arbitration times, which, in turn, depended on the interface mechanism used. Finally, we established an upper bound of 1.75 and 1.67 for the speedups obtained with the busy-wait and interrupt mechanisms, respectively, in relation to the single-port shared memory system. For the coprocessor cache memory implementation, we have the additional cost of implementing the cache memory itself, plus all the necessary logic for its operation. A cache memory of 4K × 8 bits implemented with SRAM would cost
176
Advanced Topics and Further Research
around one tenth the cost of one 1M × 8 bits DRAM, which is used in the target architecture. The cache directory can be implemented using SRAM embedded in the FPGA, such as the Intel FPGA iFX780 presented in Section 2.4.2. In this case, the size of the cache directory depends on the size of the block, as discussed in Section 7.2.3. The amount of logic associated with the cache operation is based on the description of the coprocessor memory controller and the bus interface controller, presented in Section 7.2.3 and Section 7.2.4. We could estimate the use of one Intel FPGA iFX780 for the two controllers and the cache directory. Taking into account that we already use 3 Intel FPGAs iFX780 in the original implementation of the co-design system (see Section 3.1.5), the performance expected compensates the extra cost due to the use of another FPGA and 4K × 8 bits SRAM. Finally, the content of this book, provides us with important parameters to determine the most suitable interface mechanism between the hardware and software components in a co-design environment. The results obtained, both from the physical implementation and from the simulation model, show that the characteristics of the application, in terms of memory accesses, and those of the target architecture, such as bus arbitration, handshake completion, parameter and control passing, should be considered when determining the interface mechanism. Moreover, alternative architectures should be exploited in the future, in the same manner as it was proposed here with the use of the dual-port memory and the coprocessor cache memory, in search for a better performance/cost trade-off.
8.2
Advanced Topics and Further Research
In the following sections, we explore some interesting directions for further work. This can be regarded as a natural extension of the research work presented in this book as well as some new trends in hardware/software co-design.
8.2.1
Complete VHDL Model
The VHDL model developed for the co-design system is not complete, but was considered to be suitable for the analysis undertaken. The microcontroller model does not include, for example, instruction fetch, decode and execution. As a result, we cannot simulate the complete software sub-system properly. Partitioning should take into account the communication between the two subsystems before hardware synthesis. Indeed, it is interesting to have a complete model, in order to check the performance of a system for different partitioning alternatives.
Advanced Topics and Further Research
8.2.2
177
Cost Evaluation
Throughout this work, we considered only the performance of the co-design system, in terms of execution time. However, cost is another design constraint that should be taken into account. For a given configuration, the corresponding cost can be obtained through synthesis of those modules designed for FPGA implementation, such as the dual-port memory controller and cache memory controller. Nevertheless, it would be interesting to have a cost estimation after partitioning and before synthesis. In this case, a cost estimation function could be developed and integrated with the co-design system.
8.2.3
New Configurations
Here, we investigated the use of a dual-port shared memory and coprocessor cache memory. An interesting hybrid configuration consists of a dual-port cache memory for the coprocessor. This would allow pre-fetching of data arrays based on the principle of locality, thus further reducing the cache miss rate. In the cache memory configuration, we considered only direct mapping for the block replacement policy. It would be interesting to investigate other policies, such as fully associative and set associative mapping, according to the characteristics of the application.
8.2.4
Interface Synthesis
The interface between the hardware and software sub-systems includes the necessary hardware and software for the communication between them, such as parameter and control passing. In our case, this interface was defined “a priori” and merged, in turn, with the hardware description of the critical region for synthesis. Therefore, it does not take into account the individual characteristics of each application.
8.2.5
Architecture Synthesis
A set of architecture templates could be modelled. The choice of the best architecture to employ would be based on the characteristics of the application and on the performance/cost requirements.
8.2.6
Framework for co-design
Our co-design system makes use of different tools for software profiling, hardware/software partitioning, hardware synthesis, software compilation and system integration. The development of a framework for co-design, considering all the resources already available, would help significantly the designer in obtaining a physical implementation from a technology independent initial specification.
178
Advanced Topics and Further Research
Another characteristic associated to a framework for co-design is the possibility of cosimulation, in the same manner as PTOLEMY does (see Section 2.7). System-level hardware/software co-simulation is a way to give designers feedback on their design choices, specially those related to partitioning.
8.2.7
General Formalization
The work developed in this thesis is based on the co-design system implemented at UMIST, described in Chapter 3. In order to enable a systematic analysis of the behaviour of the system for different architecture configurations, we developed a VHDL model of the co-design system, described in Chapter 4. During the simulations, three memory configurations were considered: single-port shared memory (see Chapter 5); dual-port shared memory (see Chapter 6); single-port shared memory, with a coprocessor cache memory (see Chapter 7). All the equations developed were based on the VHDL model implemented for each of these memory configurations. Therefore, they are constrained to the characteristics of the co-design system, such as the microcontroller clock frequency, coprocessor clock frequency, memory access time, interrupt acknowledge cycle, busy-wait cycle, bus arbitration cycle, handshake completion time. Considering the single-port shared memory configuration, the execution times obtained experimentally using the busy-wait mechanism (Tbw ) and the interrupt mechanism (Tint ) are expressed as: Tbw (iterations) = 5820 + (5880 × iterations)
(8.1)
Tint (iterations) = 16680 + (5760 × iterations)
(8.2)
When iterations = 0, we obtain the time necessary to run the coprocessor, without executing the function’s main loop (see Figure 5.1 in Section 5.1). This is the time spent in parameter passing, control passing to start the coprocessor, some coprocessor internal operations (to deal with the calculations of the loop controlled by iterations) and the handshake completion. The first two times are the same for both interface mechanisms. When iterations > 0, we add the time to execute the main loop, which takes into consideration the other parameters. The equations above are for operations = 10 and accesses = 1. Therefore, the value of the coefficient of iterations is associated with the execution of 10 coprocessor internal operations, 1 coprocessor memory access per iteration and the coprocessor internal
Advanced Topics and Further Research
179
control calculations for the loops control. Since a coprocessor memory access takes place, the time associated with the bus arbitration is also included in the coefficient, which will depend on the interface mechanism employed. A general formalization for the execution time could be produced, in order to allow the characteristics of different system architectures to be evaluated. We can express the execution time of the hardware sub-system as: Thw = Tpo + Tco + Tex + Tci + Tpi
(8.3)
where: Tpo : time taken to transfer the required parameters from the software subsystem to the hardware sub-system; Tco : time taken to transfer control from the software sub-system to the hardware sub-system; Tex : time taken to execute the function in the hardware sub-system; Tci : time taken to transfer control from the hardware sub-system to the software sub-system; Tpi : time taken to transfer the required parameters from the hardware subsystem to the software sub-system. Times Tpo and Tpi are related to the characteristics of the parameters of the function (type, number) and the interface between the software and the hardware sub-systems. In the last case, we consider the way in which the parameters are passed, e.g. through memory-mapped registers, and the protocol adopted for its implementation, e.g. asynchronous bus transfer. In our example (see Figure 5.1 in Section 5.1), there are zero parameters returned from the function. Therefore, we will analyze Tpo only. There are four parameters to be passed to the function: table (32 bits), iterations (16 bits), operations (16 bits) and accesses (16 bits). Besides these, there is an 8-bit coded parameter (inst) passed to the coprocessor to select the part of the VHDL state machine related to the function (see Appendix D). Considering the times provided in Section 5.2.1, we have: Tpo = 1200ns + (3 × 600ns) + 540ns
(8.4)
Tpo = 3540ns
(8.5)
Times Tco and Tci are related to the interface protocol between the two sub-systems. The microcontroller starts the coprocessor by writing into the coprocessor control register and asserting signal copro st (see Section 3.2.4). Hence, the time taken to send control to the coprocessor (Tco ) is related to the
180
Advanced Topics and Further Research
time taken by the microcontroller to write into the memory-mapped control register (see Section 5.2): Tco = 240ns
(8.6)
The coprocessor indicates the end of the function’s execution by asserting signal copro dn. Consequently, the microcontroller negates signal copro st, writing once again into the control register. Meanwhile, the coprocessor is waiting for the negation of copro st, in order to negate signal copro dn (see Appendix D), as part of the handshake protocol. We consider the time between the assertion of signal copro dn and the negation of signal copro st as the handshake completion, which is associated to the time to transfer control from the coprocessor to the microcontroller (Tci ). In the busy-wait mechanism, signal copro dn is latched into the control register and read by the microcontroller (see Section 3.2.4) during a busy-wait cycle, providing a value for Tci of (see Section 5.2.3): Tci = 300ns to 540ns
(8.7)
In the interrupt mechanism, the assertion of signal copro dn generates an interrupt request (see Section 3.2.4). Hence, the handshake completion is related to the interrupt acknowledge cycle, providing a value for Tci of (see Section 5.2.4): Tci = 11160ns
(8.8)
The analysis of Tex is carried on using the function’s VHDL code implemented by the coprocessor (see Appendix D). Tex can be subdivided into two other components: Tiops : time taken by the hardware sub-system to execute the internal operations; Tios : time taken by the hardware sub-system to execute the input/output operations and/or memory accesses. Times Tiops is estimated from the function’s VHDL code. Analyzing the state machine, the function’s execution starts in state 2. Four states (from 2 to 5) are required until the test of the parameter iterations. If iterations = 0, the main loop is not executed (see Figure 5.1 in Section 5.1) and the function is completed (state 10 and state 0). Otherwise, the main loop is entered (state 8). It takes two states (state 8 and state 9) until the test of the parameter operations. If operations neq 0, then three states (state 11 to state 13) are executed a number of times equal to the value of operations. Once this loop finishes, state 14 is entered, preparing for the loop controlled by parameter accesses. It takes
181
Advanced Topics and Further Research
two states (state 14 to state 15) until the test of the parameter accesses. If accesses = 0, three states (state 17 to state 19) are required until the test of the parameter iterations. If iterations = 0, the function is completed (state 10 and state 0). On the other hand, if accesses neq 0, then one state (state 16) is used to obtain the address of the array (table) and the loop controlled by accesses starts (from state 20 to state 24), which implements a memory access. When accesses reaches 0, the state machine enters state 17, to test the parameter iterations that controls the main loop. Once the main loop finishes, state 10 and state 0 are entered, with the state machine being kept in state 0 waiting for the handshake completion. Considering that the coprocessor’s clock cycle is of 120ns (see Section 3.2.5), we can compute the value of Tiops as: Tiops = ((2, 3, 4, 5) + (10, 0) + ((8, 9) + (11, 12, 13) × operations +(14, 15) + (16) + (20, 21, 22, 23, 24) × accesses +(17, 18, 19)) × iterations) × 120ns (8.9) Tiops = 720 + ((960 + (360 × operations) + (600 × accesses)) × iterations
(8.10)
Observe that the first parameter (720ns) is independent from the parameter iterations. As a matter of fact, it is the time required to execute the function, without executing its main loop. Hence, it is included in the times for Tbw and Tint when iterations = 0 (see, for example, (5.4) and (5.5) in Section 5.3.1). Times Tios depends on the interface between the hardware sub-system and the main system, as well as on the memory access time. Our example implements only memory accesses. A memory access is required in state 20 of the function’s state machine, by the assertion of signal mem req. The state machine then enters state 21 and remains in this state until the memory access is completed, when signal mem fin is asserted by the bus interface controller (see Figure 5.12 in Section 5.3.3). Considering a single-port shared memory, there will always be bus arbitration at every memory access. In the busy-wait mechanism, the time required to complete a memory access varies between 600ns and 840ns, with an average of 720ns. In the interrupt mechanism, the time required to complete a memory access varies between 600ns and 660ns, with an average of 630ns1 . With the above parameters defined, we can now estimate the execution time of our example, according to the characteristics of our co-design system, considering 10 internal operations and one memory access per iteration (see Section 5.3.1). Since the parameters differ according to the interface mechanism employed, we will estimate the execution time using the interrupt mechanism (Thwint ) as follows: Thwint = Tpo + Tco + Texint + T ciint
(8.11)
182
Advanced Topics and Further Research
Thwint = 3540 + 240 + ((720 + (5160 × iterations) + (630 × iterations)) + 11160 Thwint = 15660 + (5790 × iterations)
(8.12) (8.13)
For the coefficient of iterations, we are considering the average of the time required to complete the memory access (630ns), explained above. In order to obtain a closer approximation to the equivalent equation developed during the simulation process (see (5.4) and (5.5) in Section 5.3.1), we should consider the extra time introduced by the timer at the beginning and at the end of the simulation. The timer is started by the microcontroller 240ns before the first parameter is sent and it is stopped by the microcontroller 360ns after the negation of signal copro st, which means 360ns after the handshake completion. Therefore, we would obtain the following result: Thwint = 15660 + (5790 × iterations) + 240 + 360 Thwint = 16260 + (5790 × iterations)
(8.14) (8.15)
Time Tex then depends on the frequency of operation of the hardware subsystem and its interface with the main system. When using the single-port shared memory configuration, bus arbitration is always present at every coprocessor memory access, either because of bus contention, when using the busy-wait mechanism, or not, when using the interrupt mechanism. Considering the dual-port memory configuration, there is neither bus arbitration nor bus contention. These characteristics affect only the coprocessor memory accesses, for which we obtained 300ns (see Section 6.3 and Section 6.4.1). Therefore, the execution time for our example, using the interrupt mechanism, can be estimated as: Thwint = 3540 + 240 + ((720 + (5160 × iterations) + (300 × iterations)) + 11160 (8.16) Thwint = 15660 + (5460 × iterations)
(8.17)
In order to obtain a closer approximation to the equations developed during the simulation process (see (6.1) and (6.2) in Section 6.4.1), we consider the extra time introduced by the timer (600ns): Thwint = 16260 + (5460 × iterations)
(8.18)
The results obtained above for the single-port shared memory and for the dual-port shared memory validate the proposed equation for the estimation of the execution time of the hardware sub-system (Thw ), based on the characteristics of the co-design system and of the hardware function. We should, however, recall that an error is expected, since some features can not be considered
Advanced Topics and Further Research
183
in our approach, such as the synchronization between the two sub-systems. We discussed this problem when analyzing the coprocessor memory access in Figure 5.12 of Section 5.3.3 and in Figure 6.15 of Section 6.4.1.
184
Advanced Topics and Further Research
Notes 1 These timing characteristics are discussed in Section 5.3.3 and we not taking into account here the time interval between the assertion of mem fin and the next coprocessor memory request (540ns).
Appendix A Benchmark Programs
This appendix presents the two benchmark programs executed in the physical codesign system and used for performance evaluation. We present the C source program (.c) and the adapted C source program (.mod), generated by the partitioner tool synth [100]. The first program is called PLUM and has the function inner synthesised in FPGA, performing 800,000 memory accesses. The external timer is used to provide the execution time for the function proc, which calls function inner. Before calling proc, we start the timer and just after the end of the execution of proc we read the timer, keeping its value in ticks. // plum.c void inner(TYPE *A, TYPE *B, TYPE m[10]) { TYPE a, b, c; a = *A; b = *B; for (c = 1; c <= 40; ++c) { a = a + b + c; b = a >> 1; a = b % 10; m[a] = a; b = m[a] - b - c; a = b == c; b = b | c; a = !b; b = b + c; a = b > c; } *A = a; *B = b; } void proc (void) { STOR_CL TYPE a, b, c; int d, major;
185
186
Appendix A: Benchmark Programs
static TYPE m[10] = {0}; major = 10000; a = b = 34; for (d = 1; d <= major; ++d) inner (&a, &b, m); } main () { unsigned long long ticks = 0; start_cntr(); proc(); ticks = stop_cntr (); { char message[128]; sprintf(message, "Execution time using the external timer = %lu ns\n",(unsigned long) ticks); __write (1, message, strlen(message)); } exit (0); } In the adapted form of PLUM (plum.mod), we identify, in the variable declaration part of the function, pointers declaration and their corresponding addresses, used for parameter passing. In the body of the function, the transfer of parameters is performed, followed by the “hardware” call/return, implementing the busy-wait mechanism. // plum.mod void inner(TYPE *A, TYPE *B, TYPE m[10]) { unsigned char *const inst = (unsigned char *) 0x680003; char** const param0 = (char**) (0x680004 + 4 - sizeof(char*)); char** const param1 = (char**) (0x680008 + 4 - sizeof(char*)); char** const param2 = (char**) (0x68000c + 4 - sizeof(char*)); *inst = 0; *param0 = A; *param1 = B; *param2 = m; start(); while(!is_finished()); ack_stop(); } void proc (void){ STOR_CL TYPE a, b, c; int d, major; static TYPE m[10] = {0}; major = 10000; a = b = 34;
Appendix A: Benchmark Programs
187
for (d = 1; d <= major; ++d) inner (&a, &b, m); } main () { unsigned long long ticks = 0; start_cntr(); proc(); ticks = stop_cntr (); { char message[128]; sprintf (message, "Execution time, using the external timer = %lu ms\n",(unsigned long) ticks ); __write (1, message, strlen(message)); } exit (0); } The second program is called EGCHECK, shown below. Its function decode is implemented in FPGA, performing 6.3 million memory accesses. // egcheck.c void decode(short int highword[7000], short int lowword[7000], short int _f1[7000], short int _f2[7000], short int _f3[7000]) { int count; int temp1, temp2; for (count = 0; count != 7000; count++) { temp1 = highword[count]; temp2 = lowword[count]; _f1[count] = ((temp1 & 0xc000) >> 14); _f2[count] = ((temp1 & 0x3f80) >> 7); highword[count] = (temp1 & 0x007f); _f1[count] = ((temp2 & 0xe000) >> 13); _f2[count] = ((temp2 & 0x1800) >> 11); _f3[count] = ((temp2 & 0x0700) >> 8); lowword[count] = (temp2 & 0x00ff); } } short int array[20000]; short int high[7000], low[7000]; short int wd1[7000], wd2[7000], wd3[7000]; main() { int count; unsigned long long ticks;
188
Appendix A: Benchmark Programs
const int iterations = 100; for (index=0; index
// egcheck.mod void decode( short short short short short
int int int int int
highword[7000], lowword[7000], _f1[7000], _f2[7000], _f3[7000]) {
unsigned char *const inst=(unsigned char *)0x680003; short int** const param0 = (short int**) (0x680004 + 4 - sizeof(short int*)); short int** const param1 = (short int**) (0x680008 + 4 - sizeof(short int*)); short int** const param2 = (short int**) (0x68000c + 4 - sizeof(short int*)); short int** const param3 = (short int**) (0x680010 + 4 - sizeof(short int*)); short int** const param4 = (short int**) (0x680014 + 4 - sizeof(short int*)); *inst = 0; *param0 = highword; *param1 = lowword; *param2 = _f1; *param3 = _f2; *param4 = _f3; start();
Appendix A: Benchmark Programs while(!is_finished()); ack_stop(); } short int array[20000]; short int high[7000], low[7000]; short int wd1[7000], wd2[7000], wd3[7000]; main() { int count; unsigned long long ticks; const int iterations = 100; for (index=0; index
189
Appendix B Top-Level VHDL Model of the Co-design System
This appendix presents the complete VHDL description of the top-level model for the codesign system (main system), discussed in Section 4.2, of Chapter 4. As explained then, external connections are not required, which means that there are no entity ports. LIBRARY ieee; USE ieee.std_logic_1164.ALL; USE work.system_types.ALL; ENTITY main_system IS -- no external interfaces. END main_system; ARCHITECTURE structure of main_system IS -- system main component declarations. COMPONENT dram16 -- {U6,U7} PORT( rNw: IN bitz; -- read write Ncas0: IN bitz; -- column address strobe0 Ncas1: IN bitz; -- column address strobe1 Nras: IN bitz; -- row address strobe addr_mux: IN addrmux; -- row col multiplexed bus data: INOUT databus -- 16 bit data bus); END COMPONENT; COMPONENT addr_mux PORT( clk: IN Nreset: IN uNl: IN same_page: OUT addr_inc: IN cpu_space: IN addr: IN addr_mux: OUT
--{U8} bitz; -- main system clock oc_bus01; -- reset signal bitz; -- upper / lower byte bitz; bitz; -- inc address for page mode bitz; bus_vector(22 DOWNTO 1); addrmux -- row col multiplexed bus);
191
192
Appendix B: Top-Level VHDL Model of the Co-design System
END COMPONENT;
COMPONENT main_ctrl --{U5} PORT( clk: IN bitz; -- main system clock Nreset: IN oc_bus01; -- reset signal cpu_space: OUT bitz; refresh: IN bitz; addr_inc: OUT bitz; -- inc address for page mode Nbr: OUT bitz; -- bus request Nbgack: OUT bitz; -- bus grant acknowledge Nbg: IN bitz; -- bus grant Nrmc: IN bitz; -- read modify cycle Ncsboot: IN bitz; same_page: IN bitz; Nras: BUFFER bitz; -- combined ras0 ras1 signal Ncas0: OUT bitz; Ncas1: OUT bitz; uNl: OUT bitz; Nwr: OUT bitz; Noe: OUT bitz; rNw: INOUT bus01; Nds: INOUT bus01; Nas: INOUT bus01; Nhalt: INOUT bitz; Nberr: INOUT bitz; fc: IN fcbus; siz: IN bus02; Ndsack: INOUT oc_bus02; c1_ctrl: INOUT ctrlbus; -- control lines c2_ctrl: INOUT ctrlbus; -- control lines addr: IN bus_vector(23 DOWNTO 21); addr0: IN bitz -- address line 0); END COMPONENT; COMPONENT microcontroller PORT(clk: OUT bitz; Nreset: OUT oc_bus01; rNw: OUT bus01; Nds: OUT bus01; Nas: OUT bus01; Nrmc: OUT bitz; Navec: IN bitz; Nhalt: INOUT bitz; Nberr: IN bitz; Ncsboot: OUT bitz; Nbr: IN bitz; Nbg: OUT bitz; Nbgack: IN bitz; refresh: OUT bitz;
--{U1} -- main system clock -- reset -- read write -- data strobe -- address stobe -- read modify cycle -- autovector for ints -- halt -- bus error -- cs for boot ROM -- bus request -- bus grant -- bus grant acknowledge -- DRAM refresh from TPU
Appendix B: Top-Level VHDL Model of the Co-design System addr: OUT data: INOUT fc: OUT siz: OUT Ndsack: IN Nirq: IN END COMPONENT;
addrbus; databus; fcbus; bus02; oc_bus02; oc_bus02
-------
24 bit address bus 16 bit data bus function code bus data size signal lines data size acknowledge interrupt request);
COMPONENT copro_board -- {C2} PORT(clk: IN bitz; -- main system clock Nreset: IN oc_bus01; Nas: INOUT bitz; Nds: INOUT bitz; rNw: INOUT bitz; Nhalt: INOUT bitz; Nberr: INOUT bitz; Navec: OUT bitz; fc: OUT fcbus; -- function codes addr: INOUT addrbus; -- 24 bit address bus data: INOUT databus; -- 16 bit data bus siz: INOUT bus02; -- data size lines Ndsack: INOUT oc_bus02;-- data size ackowledge lines Nirq: OUT oc_bus02; -- interrupt request ctrl: INOUT ctrlbus -- control codes for slot(c2_ctrl) ); END COMPONENT; COMPONENT timer -- {C1} PORT(clk: IN bitz; -- main system clock Nreset: IN oc_bus01; Nas: INOUT bitz; Nds: INOUT bitz; rNw: INOUT bitz; addr: INOUT addrbus; -- 24 bit address bus data: INOUT databus; -- 16 bit data bus siz: INOUT bus02; -- data size lines Ndsack: INOUT oc_bus02; -- data size ackowledge lines ctrl: INOUT ctrlbus -- control codes for slot(c1_ctrl) ); END COMPONENT; -- bit signal declarations SIGNAL clk: bitz; -- main system clock SIGNAL Nreset: oc_bus01; -- reset line set to high at present SIGNAL Nrmc: bitz; -- read modify cycle. SIGNAL Ncsboot: bitz;
193
194
Appendix B: Top-Level VHDL Model of the Co-design System
SIGNAL Nbr: bitz; -- bus request. SIGNAL Nbg: bitz; -- bus grant. SIGNAL Nbgack: bitz; -- bus grant acknowledge. SIGNAL refresh: bitz; -- refresh signal originating from tp0 SIGNAL Nwr: bitz; -- read write signal SIGNAL Ncas0: bitz; -- col address strobe 0 SIGNAL Ncas1: bitz; -- col address strobe 1 SIGNAL Nras: bitz; -- row address strobe 0 & 1 SIGNAL uNl: bitz; -- upper lower byte SIGNAL same_page: bitz; SIGNAL addr_inc: bitz; -- keep row address but inc col address SIGNAL cpu_space: bitz; SIGNAL Noe: bitz; -- output enable. -- bus signal declarations (multidriver lines). SIGNAL rNw: bus01 BUS; SIGNAL Nds: bus01 BUS; SIGNAL Nas: bus01 BUS; SIGNAL Navec: oc_bus01 BUS; SIGNAL Nhalt: oc_bus01 BUS; SIGNAL Nberr: oc_bus01 BUS; SIGNAL addr: addrbus BUS; -- Tri-stated address bus SIGNAL data: databus BUS; -- Tri-stated data bus SIGNAL fc: fcbus BUS; -- Tri-stated function code bus SIGNAL siz: bus02 BUS; -- Tri-stated data size request signal SIGNAL Ndsack: oc_bus02 BUS; -- data size acknowledge SIGNAL Nirq: oc_bus02 BUS; SIGNAL c1_ctrl: ctrlbus BUS; -- control codes for slot one SIGNAL c2_ctrl: ctrlbus BUS; -- control codes for slot two -- bit vector signal declarations (single driver lines). SIGNAL addr_mux: addrmux; -- multiplexed row col address for DRAM -- main system architecture body BEGIN C1: timer -- {using slot one} PORT MAP( clk => clk, -- main system clock Nreset => Nreset, Nas => Nas, Nds => Nds, rNw => rNw,
Appendix B: Top-Level VHDL Model of the Co-design System addr data siz Ndsack ctrl C2: copro_board PORT MAP( clk Nreset Nas Nds rNw Nhalt Nberr Navec fc addr data siz Ndsack Nirq ctrl U1:
=> => => => => -=> => => => => => => => => => => => => => =>
addr, data, siz, Ndsack, c1_ctrl
-- slot one control codes);
{using slot two} clk, -- main system clock Nreset, Nas, Nds, rNw, Nhalt, Nberr, Navec, fc, addr, data, siz, Ndsack, Nirq, c2_ctrl -- slot two control codes);
microcontroller PORT MAP( clk => clk, -- main system clock Nreset => Nreset, rNw => rNw, Nds => Nds, Nas => Nas, Nrmc => Nrmc, Navec => Navec, Nhalt => Nhalt, Nberr => Nberr, Ncsboot => Ncsboot, Nbr => Nbr, Nbg => Nbg, Nbgack => Nbgack, refresh => refresh, addr => addr, data => data, fc => fc, siz => siz, Nirq => Nirq, Ndsack => Ndsack); U5: main_ctrl PORT MAP( clk => clk, -- main system clock Nreset => Nreset, -- reset signal (not used) cpu_space => cpu_space, refresh => refresh, addr_inc => addr_inc,
195
196
Appendix B: Top-Level VHDL Model of the Co-design System Nbr Nbgack Nbg Nrmc Ncsboot same_page Nras Ncas0 Ncas1 uNl Nwr Noe rNw Nds Nas Nhalt Nberr fc siz Ndsack c1_ctrl c2_ctrl addr addr0
=> => => => => => => => => => => => => => => => => => => => => => => =>
Nbr, Nbgack, Nbg, Nrmc, Ncsboot, same_page, Nras, Ncas0, Ncas1, uNl, Nwr, Noe, rNw, Nds, Nas, Nhalt, Nberr, fc, siz, Ndsack, c1_ctrl, c2_ctrl, addr(23 DOWNTO 21), addr(0));
U8: addr_mux PORT MAP( clk Nreset uNl same_page addr_inc cpu_space addr addr_mux
=> => => => => => => =>
clk, Nreset, uNl, same_page, addr_inc, cpu_space, addr(22 DOWNTO 1), addr_mux);
U6: dram16 PORT MAP( rNw Nras data Ncas0 Ncas1 addr_mux END structure;
=> => => => => =>
rNw, Nras, -- combined ras1 ras2 sig data(15 DOWNTO 0), Ncas0, -- column address strobe 0 Ncas1, addr_mux); -- multiplexed bus
-- Main system components configuration details. CONFIGURATION main_system_config OF main_system IS FOR structure FOR U6: dram16 USE CONFIGURATION work.dram16_config;
Appendix B: Top-Level VHDL Model of the Co-design System END FOR; FOR U8: addr_mux USE ENTITY work.addr_mux(rtl); END FOR; FOR U5: main_ctrl USE ENTITY work.main_ctrl(behaviour); END FOR; FOR U1: microcontroller USE CONFIGURATION work.microcontroller_config; END FOR; FOR C2: copro_board --{slot two} USE CONFIGURATION work.copro_board_config; END FOR; FOR C1: timer --{slot one} USE ENTITY work.timer(behaviour); END FOR; END FOR; END main_system_config;
197
Appendix C Translating PALASMT M into VHDL
r 2 [48] into a VHDL description The translation of a circuit specification written in PALASM is quite straightforward. Combinatorial circuits are implemented in PALASM using Boolean equations. The output from the equation is a pin name or a node name. For example, we can have the following statement in PALASM: a = b ∗ c ∗ /d
(C.1)
a <= b AND c AND (NOT d);
(C.2)
which translates into VHDL as:
In PALASM, we specify the controlling logic for the output enable of a signal using the extension “.TRST”. For example, in the following statement: /a = /b * /c /a.TRST = d when signal d is asserted, i.e. d = 0, the output signal a goes to high impedance. This allows for the implementation of bi-directional signals. In VHDL, this control translates into a process, with sensitivity list consisting of the output enable signal and those signals that drive signal a. Then, we have the following equivalent statement in VHDL: sig_a: PROCESS (b, c, d) BEGIN IF (d = ’1’) THEN a <= NOT (NOT b AND NOT c); ELSE a <= ’Z’; END IF; END PROCESS sig_a; Registered circuits are implemented in PALASM using the same Boolean equation syntax as combinatorial circuits, but with a “:=” operator in place of the “=”. When using the dedicated
199
Appendix C: Translating PALASMT M into VHDL
200
clock pin on a device with a single clock, it is not necessary to specify the clock input. However, it is possible to choose the clock as part of the design methodology. The clock for the circuit uses the output name with a “.CLKF” extension on the left-hand side of the “=” symbol and the clock input name on the right-hand side. In the following example, signal a is registered, using the system clock clk: a.CLKF = clk; rising edge of the clock a := b which can be implemented in VHDL as: a <= b WHEN (clk’EVENT AND clk = ’1’) ELSE a; Asynchronous preset is provided in PALASM to allow registers to be independently preset, i.e., set to 1. The preset equation is implemented by using the output name with a “.SETF” extension. Another example of registered output is given by signal a below: a.CLKF = /clk; falling edge of the clock a.SETF = /Nreset a.TRST = e /a := b + c + d which, in VHDL, can be implemented as: sig_a: PROCESS (clk, Nreset, e) BEGIN IF (Nreset = ’0’ AND e = ’1’) THEN a <= ’1’; ELSIF e = ’0’ THEN a <= ’Z’; ELSIF (clk’EVENT AND clk = ’0’) THEN a <= NOT (b OR c OR d); END IF; END PROCESS sig_a; In the same manner, asynchronous clear is also provided in PALASM to allow the registers to be independently reset, i.e., set to 0. The reset equation is implemented by using the output name with a “.RSTF” extension. Asynchronous clocking of registers is implemented in PALASM by assigning the register clock to an equation or pin other than a dedicated clock input. The clock signal uses the output name with an “.ACLK” extension. The following statement exemplifies the use of asynchronous clocking for signal a: a.ACLK = b a.RSTF = /Nreset a := c translating into VHDL as: sig_a: PROCESS (b, Nreset) BEGIN
Appendix C: Translating PALASMT M into VHDL
201
IF Nreset = ’0’ THEN a <= ’0’; ELSIF (b’EVENT AND b = ’1’) THEN a <= c; END IF; END PROCESS sig_a; Truth tables provide an way of describing a design or parts of a design. In PALASM, each truth table section begins with the “T TAB” keyword. They are position dependent, i.e., each input has a corresponding column for each row of the table. The first line of the table lists the input and output signals for the truth table inside parentheses. Subsequent rows list the values of each output for each combination of inputs. The following exemplifies the use of truth tables: T_TAB (a 0101 : 0110 : 0100 : 1000 :
b c d >> e f g h) 0011 0011 1100 0011
In VHDL, we specify the above truth table as the following process: t_table: PROCESS (a, b, c, d) BEGIN IF (a = ’0’ AND b = ’1’ AND c = ’0’ AND d = ’1’) OR (a = ’0’ AND b = ’1’ AND c = ’1’ AND d = ’0’) OR (a = ’1’ AND b = ’0’ AND c = ’0’ AND d = ’0’) THEN e <= ’0’; f <= ’0’; g <= ’1’; h <= ’1’; ELSIF (a = ’0’ AND b = ’1’ AND c = ’0’ AND d = ’0’) THEN e <= ’1’; f <= ’1’; g <= ’0’; h <= ’0’; ELSE e <= ’0’; f <= ’0’; g <= ’0’; h <= ’0’; END IF; END PROCESS t_table; State machines provide a way of describing sequential logic. Each state machine in PALASM begins with the “STATE” keyword followed by the machine type which can be either “MEALY MACHINE” or “MOORE MACHINE”. Outputs on Moore machines depend on the current state only. Outputs on Mealy machines depend on both the current state and next state information. This is followed by subsections that identify the global defaults, state transitions, output values and transition conditions. The conditions subsection is denoted by the “CONDITIONS”
Appendix C: Translating PALASMT M into VHDL
202
keyword. Synchronous state machines transition on the rising edge of a dedicated clock pin. Asynchronous state machines transition when the specified condition used as the clock is true. The following exemplifies the specification of a state machine in PALASM: STATE MOORE_MACHINE ; state assignments s0 = /q1 * /q0 s1 = /q1 * q0 s2 = q1 * /q0 s3 = q1 * q0 ; state transitions s0 := cond1 -> s1 s1 := cond2 -> s2 s2 := cond3 -> s3 s3 := cond4 -> s4 ; transition outputs s0.OUTF = a s1.OUTF = a s2.OUTF = /a s3.OUTF = a CONDITIONS cond1 = b * /c * /d cond2 = /b cond3 = d cond4 = c EQUATIONS q1.RSTF = /Nreset q1.CLKF = clk q0.RSTF = /Nreset q0.CLKF = clk In VHDL, the corresponding state machine is specified as a process, with its sensitivity list consisting of the clock (clk) and the reset signals (Nreset): state_machine: PROCESS (clk, Nreset) BEGIN IF Nreset = ’0’ THEN state ELSIF (clk’EVENT and clk = CASE state IS WHEN s0 => IF (b = ’1’ AND state <= s1; END IF; WHEN s1 => IF b = ’0’ THEN END IF; WHEN s2 => IF d = ’1’ THEN END IF;
<= s0; ’1’) THEN
c = ’0’ AND d = ’0’) THEN
state <= s2;
state <= s3;
Appendix C: Translating PALASMT M into VHDL WHEN s3 => IF c = ’1’ THEN state <= s0; END IF; END CASE; END IF; END PROCESS state_machine;
203
Appendix D VHDL Version of the Case Study
This appendix presents the VHDL versions of the C function example, presented IN Figure 5.1 of Chapter 5. The first version consists of the behavioral description of the function, used for synthesis and is generated by the partitioner tool synth. The second version corresponds to the register-transfer description of the function, used for simulation and is generated by the partitioner tool synth2. A complete description of these tools can be found IN [100].
Version 1 of Example -- autogenerated with translation tool synth -- translation tool INternal identification library asyl, votanlib, csynth; use asyl.arith.all; use votanlib.timINg_constraints.all; use csynth.synthStuff.all; entity xciu is port(clk: IN bit; reset: IN bit; copro_st: IN bit; copro_dn: OUT bit; fpga_rd: IN bit; fpga_wr: IN bit; fpga_cs0: IN bit; reg_fin: OUT bit; fpga_rw: OUT bit; mem_req: OUT bit; mem_fin: IN bit; fpga_size: OUT bit_vector (1 downto 0); ain: IN bit_vector (4 downto 2); aout: OUT bit_vector (23 downto 0);
205
206
Appendix D: VHDL Version of the Case Study
din: IN bit_vector (31 downto 0); dOUT: OUT bit_vector (31 downto 0); den: IN bit_vector (3 downto 0)); end xciu; architecture procedure_level of xciu is begin main: process procedure memRd (bytes: IN bit_vector (1 downto 0); a_OUT: IN bit_vector; d_in: OUT bit_vector (31 downto 0)) is begin fpga_size <= bytes; fpga_rw <= ’1’; mem_req <= ’1’; aout <= a_OUT(23 downto 0); wait until clk = ’1’; mem_req <= ’0’; -- single pulse mem_req loop wait until clk = ’1’; exit when mem_fin = ’1’; end loop; d_in:= din (31 downto 0); wait_source (0); end memRd; procedure memRd8 (bytes: IN bit_vector (1 downto 0); a_OUT: IN bit_vector; d_in: OUT bit_vector (7 downto 0)) is begin fpga_size <= bytes; fpga_rw <= ’1’; mem_req <= ’1’; aout <= a_OUT(23 downto 0); wait until clk = ’1’; mem_req <= ’0’; -- single pulse mem_req loop wait until clk = ’1’; exit when mem_fin = ’1’; end loop; if a_OUT(0) = ’1’ then d_in:= din (7 downto 0); else d_in:= din (15 downto 8); end if; wait_source (0); end memRd8; procedure memRd16 (bytes: IN bit_vector (1 downto 0); a_OUT: IN bit_vector;
Appendix D: VHDL Version of the Case Study d_in: OUT bit_vector (15 downto 0)) is begin fpga_size <= bytes; fpga_rw <= ’1’; mem_req <= ’1’; aout <= a_OUT(23 downto 0); wait until clk = ’1’; mem_req <= ’0’; -- single pulse mem_req loop wait until clk = ’1’; exit when mem_fin = ’1’; end loop; d_in:= din (15 downto 0); wait_source (0); end memRd16; procedure memWr (bytes: IN bit_vector (1 downto 0); a_OUT: IN bit_vector; d_OUT: IN bit_vector ) is begin fpga_siz e <= bytes; fpga_rw <= ’0’; aout <= a_OUT (23 downto 0); if bytes = bit_vector’(b"01") then dOUT (15 downto 8) <= d_OUT (7 downto 0); dOUT (7 downto 0) <= d_OUT (7 downto 0); else dOUT (d_OUT’high downto d_OUT’low) <= d_OUT; end if; mem_req <= ’1’; wait until clk = ’1’; mem_req <= ’0’; -- single pulse mem_req loop wait until clk = ’1’; exit when mem_fin = ’1’; end loop; wait_source (0); end memWr; procedure unsignedMulD (op1, op2: IN bit_vector; lower_res: INOUT bit_vector, higher_res: INOUT bit_vector) is -- higher_res is P, lower_res/op1 is A, op2 is B begin assert((op1’length=op2’length) and (op1’length=higher_res’length) and (op1’length=lower_res’length)) report "Mismatch IN operand sizes" severity error; lower_res:= op1; higher_res:= to_bitvector(0, higher_res’length);
207
208
Appendix D: VHDL Version of the Case Study
for i IN op1’range loop if lower_res(lower_res’low) = ’1’ then higher_res:= higher_res + op2; else higher_res:= higher_res + to_bitvector(0,higher_res’length); end if; lower_res:= shiftR(lower_res,"00001"); lower_res(lower_res’high):= higher_res(higher_res’low); higher_res:=shiftR(higher_res,"00001"); end loop; end unsignedMulD; procedure signedMulD (op1, op2: IN bit_vector; lower_res, higher_res: INOUT bit_vector) is -- higher_res is P, lower_res/op1 is A, op2 is B variable last_lsb:bit; begin assert((op1’length=op2’length) and (op1’length=higher_res’length) and (op1’length=lower_res’length)) report "Mismatch IN operand sizes" severity error; lower_res:= op1; last_lsb:= ’0’; higher_res:= to_bitvector(0, higher_res’length); for i IN op1’range loop if (lower_res(lower_res’low) = ’0’) and (last_lsb=’0’) then higher_res:= higher_res + to_bitvector(0,higher_res’length); elsif (lower_res(lower_res’low) = ’0’) and (last_lsb=’1’) then higher_res:= higher_res + op2; elsif (lower_res(lower_res’low) = ’1’) and (last_lsb=’0’) then higher_res:= higher_res - op2; else higher_res:= higher_res + to_bitvector(0,higher_res’length); end if; flast_lsb:= lower_res(lower_res’low); lower_res:= shiftR(lower_res,"00001"); lower_res(lower_res’high):= higher_res(higher_res’low); higher_res:=signedShiftR(higher_res,"00001"); end loop; end signedMulD; procedure example(variable g0: INOUT bit_vector (31 downto 0);
Appendix D: VHDL Version of the Case Study variable g1: INOUT bit_vector (31 downto 0); variable g2: INOUT bit_vector (31 downto 0); variable g3: INOUT bit_vector (31 downto 0) ) is variable g4: bit_vector (31 downto 0); variable g5: bit_vector (31 downto 0); variable g6: bit_vector (31 downto 0); variable g7: bit_vector (31 downto 0); variable g13: bit_vector (31 downto 0); variable r4: bit_vector (31 downto 0); variable r5: bit_vector (31 downto 0); begin g1:= shiftL(g1, x"00000010"); g7:= x"00000000"; g1:= signedSHIFTR(g1, x"00000010"); g13:= x"00000000"; if signedGt(g1, g13) then g2:= shiftL(g2, x"00000010"); g3:= shiftL(g3, x"00000010"); r5:= g1; r4:= signedSHIFTR(g2, x"00000010"); g1:= signedSHIFTR(g3, x"00000010"); loop g4:= x"00000000"; if signedGt(r4, g4) then g5:= signedSHIFTR(g2, x"00000010"); loop g7:= x"00000001" + g7; g4:= x"00000001" + g4; exit when not signedGt(g5, g4); end loop; end if; g4:= x"00000000"; if signedGt(g1, g4) then g6:= signedSHIFTR(g3, x"00000010"); g5:= g0; loop memWr(b"10", g5, g7); g4:= x"00000001" + g4; exit when not signedGt(g6, g4); end loop; end if; g0:= x"00000002" + g0; g13:= x"00000001" + g13; exit when not signedGt(r5, g13); end loop; end if; end example; variable param0: bit_vector (31 DOWNTO 0); variable param1: bit_vector (31 DOWNTO 0);
209
210
Appendix D: VHDL Version of the Case Study
variable param2: bit_vector (31 DOWNTO 0); variable param3: bit_vector (31 DOWNTO 0); variable inst: bit_vector (0 DOWNTO 0); begin outer_loop: loop mem_req <= ’0’; reg_fin <= ’0’; wait until clk = ’1’; loop exit when copro_st = ’0’; wait until clk = ’1’; end loop; copro_dn <= ’0’; idle_access: loop reg_fin <= ’0’; loop wait until clk = ’1’; exit idle_access when copro_st = ’1’; exit when fpga_cs0 = ’1’; end loop; loop exit when fpga_rd = ’1’ or fpga_wr = ’1’; wait until clk = ’1’; end loop; case ain (4 downto 2) is when b"000" => if fpga_wr = ’1’ then if den(0) = ’1’ then inst:= din(0 DOWNTO 0); end if; end if; when b"001" => if fpga_wr = ’1’ then if den(3) = ’1’ then param0(31 DOWNTO 24):=din(31 DOWNTO 24); end if; if den(2) = ’1’ then param0(23 DOWNTO 16):=din(23 DOWNTO 16); end if; if den(1) = ’1’ then param0(15 DOWNTO 8):=din(15 DOWNTO 8); end if; if den(0) = ’1’ then param0(7 DOWNTO 0):=din(7 DOWNTO 0); end if; end if; when b"010" => if fpga_wr = ’1’ then if den(3) = ’1’ then param1(31 DOWNTO 24:=din(31 DOWNTO 24); end if; if den(2) = ’1’ then
Appendix D: VHDL Version of the Case Study param1(23 DOWNTO 16):=din(23 DOWNTO 16); end if; if den(1) = ’1’ then param1(15 DOWNTO 8):=din(15 DOWNTO 8); end if; if den(0) = ’1’ then param1(7 DOWNTO 0):=din(7 DOWNTO 0); end if; end if; when b"011" => if fpga_wr = ’1’ then if den(3) = ’1’ then param2(31 DOWNTO 24):=din(31 DOWNTO 24); end if; if den(2) = ’1’ then param2(23 DOWNTO 16):=din(23 DOWNTO 16); end if; if den(1) = ’1’ then param2(15 DOWNTO 8):=din(15 DOWNTO 8); end if; if den(0) = ’1’ then param2(7 DOWNTO 0):=din(7 DOWNTO 0); end if; end if; when b"100" => if fpga_wr = ’1’ then if den(3) = ’1’ then param3(31 DOWNTO 24):=din(31 DOWNTO 24); end if; if den(2) = ’1’ then param3(23 DOWNTO 16):=din(23 DOWNTO 16); end if; if den(1) = ’1’ then param3(15 DOWNTO 8):=din(15 DOWNTO 8); end if; if den(0) = ’1’ then param3(7 DOWNTO 0):=din(7 DOWNTO 0); end if; end if; when others => null; end case; reg_fin <= ’1’; wait until clk = ’1’; -- assume that fpga_cs0 goes INactive here reg_fin <= ’0’; end loop idle_access; wait_source(0); case inst is when b"0" => example(param0, param1, param2, param3); when others => null; end case; copro_dn <= ’1’;
211
212
Appendix D: VHDL Version of the Case Study
end loop outer_loop; end process; end procedure_level;
Version 2 of Example ----------
autogenerated with translation tool synth2 *number of adder resources = 2 *number of subtractor resources = 1 *fsm style = 0 *explicit resource allocation = off *squash states = on *full r/w registers = off *constant progigation optimisation = on *dead-code elimINation optimisation = on
library asyl; use asyl.arith.all; entity xciu is port ( clk: IN bit; reset: IN bit; copro_st: IN bit; copro_dn: OUT bit; fpga_rd: IN bit; fpga_wr: IN bit; fpga_cs0: IN bit; reg_fin: OUT bit; fpga_rw: OUT bit; mem_req: OUT bit; mem_fin: IN bit; fpga_size: OUT bit_vector(1 downto 0); ain: IN bit_vector(4 downto 2); aout: OUT bit_vector(23 downto 0); din: IN bit_vector(31 downto 0); dOUT: OUT bit_vector(31 downto 0); den: IN bit_vector(3 downto 0)); end xciu; architecture rtl_level of xciu is signal present_state:INteger range begin fsm:process(clk, reset) variable g0: bit_vector(31 downto variable g1: bit_vector(31 downto variable g2: bit_vector(31 downto variable g3: bit_vector(31 downto variable g4: bit_vector(31 downto variable g5: bit_vector(31 downto variable g6: bit_vector(31 downto
0 to 24;
0); 0); 0); 0); 0); 0); 0);
Appendix D: VHDL Version of the Case Study variable g7: bit_vector(31 downto 0); variable inst: bit_vector(7 downto 0); variable tmp: bit_vector(31 downto 0); begin if reset = ’1’ then present_state <= 0; copro_dn <= ’1’; reg_fin <= ’0’; fpga_rw <= ’0’; mem_req <= ’0’; fpga_size <= b"00"; aout <= x"000000"; dout <= x"00000000"; elsif clk’event and clk = ’1’ then case present_state is when 0 => if copro_st = ’0’ then present_state <= 1; else present_state <= 0; end if; when 1 => copro_dn <= ’0’; if copro_st = ’1’ then present_state <= 2; elsif fpga_cs0 = ’1’ then present_state <= 3; else present_state <= 1; end if; when 2 => if inst(0) = ’0’ then present_state <= 4; else present_state <= 2; end if; when 3 => if fpga_rd = ’1’ then present_state <= 6; elsif fpga_wr = ’1’ then present_state <= 6; else present_state <= 3; end if; when 4 => g6:= x"00000000"; g7:= x"00000000"; tmp:= x"00000000" - g1; present_state <= 5; when 5 => if tmp(31) = ’0’ then present_state <= 10; else present_state <= 8; end if; when 6 => if ain = b"000" then
213
214
Appendix D: VHDL Version of the Case Study if fpga_wr = ’1’ then if den(0) = ’1’ then inst(0):= din(0); end if; end if; end if; if ain = b"001" then if fpga_wr = ’1’ then if den(0) = ’1’ then g0(7 downto 0):= din(7 downto 0); end if; end if; end if; if ain = b"001" then if fpga_wr = ’1’ then if den(1) = ’1’ then g0(15 downto 8):= din(15 downto 8); end if; end if; end if; if ain = b"001" then if fpga_wr = ’1’ then if den(2) = ’1’ then g0(23 downto 16):= din(23 downto 16); end if; end if; end if; if ain = b"001" then if fpga_wr = ’1’ then if den(3) = ’1’ then g0(31 downto 24):= din(31 downto 24); end if; end if; end if; if ain = b"010" then if fpga_wr = ’1’ then if den(0) = ’1’ then g1(7 downto 0):= din(7 downto 0); end if; end if; end if; if ain = b"010" then if fpga_wr = ’1’ then if den(1) = ’1’ then g1(15 downto 8):= din(15 downto 8); end if; end if; end if; if ain = b"010" then if fpga_wr = ’1’ then
Appendix D: VHDL Version of the Case Study if den(2) = ’1’ then g1(23 downto 16):= din(23 downto 16); end if; end if; end if; if ain = b"010" then if fpga_wr = ’1’ then if den(3) = ’1’ then g1(31 downto 24):= din(31 downto 24); end if; end if; end if; if ain = b"011" then if fpga_wr = ’1’ then if den(0) = ’1’ then g2(7 downto 0):= din(7 downto 0); end if; end if; end if; if ain = b"011" then if fpga_wr = ’1’ then if den(1) = ’1’ then g2(15 downto 8):= din(15 downto 8); end if; end if; end if; if ain = b"011" then if fpga_wr = ’1’ then if den(2) = ’1’ then g2(23 downto 16):= din(23 downto 16); end if; end if; end if; if ain = b"011" then if fpga_wr = ’1’ then if den(3) = ’1’ then g2(31 downto 24):= din(31 downto 24); end if; end if; end if; if ain = b"100" then if fpga_wr = ’1’ then if den(0) = ’1’ then g3(7 downto 0):= din(7 downto 0); end if; end if; end if; if ain = b"100" then if fpga_wr = ’1’ then if den(1) = ’1’ then
215
216
Appendix D: VHDL Version of the Case Study g3(15 downto 8):= din(15 downto 8); end if; end if; end if; if ain = b"100" then if fpga_wr = ’1’ then if den(2) = ’1’ then g3(23 downto 16):= din(23 downto 16); end if; end if; end if; if ain = b"100" then if fpga_wr = ’1’ then if den(3) = ’1’ then g3(31 downto 24):= din(31 downto 24); end if; end if; end if; reg_fin <= ’1’; present_state <= 7; when 7 => reg_fin <= ’0’; present_state <= 1; when 8 => g4:= x"00000000"; tmp:= x"00000000" - g2; present_state <= 9; when 9 => if tmp(31) = ’0’ then present_state <= 14; else present_state <= 11; end if; when 10 =>copro_dn <= ’1’; present_state <= 0; when 11 =>g6:= x"00000001" + g6; g4:= x"00000001" + g4; present_state <= 12; when 12 =>tmp:= g4 - g2; present_state <= 13; when 13 =>if tmp(31) = ’1’ then present_state <= 11; else present_state <= 14; end if; when 14 =>g4:= x"00000000"; tmp:= x"00000000" - g3; present_state <= 15; when 15 =>if tmp(31) = ’0’ then present_state <= 17; else present_state <= 16; end if;
Appendix D: VHDL Version of the Case Study when 16 =>g5:= g0; present_state <= 20; when 17 =>g0:= x"00000002" + g0; g7:= x"00000001" + g7; present_state <= 18; when 18 =>tmp:= g7 - g1; present_state <= 19; when 19 =>if tmp(31) = ’1’ then present_state <= 8; else present_state <= 10; end if; when 20 =>dOUT(15 downto 0) <= g6(15 downto 0); aout <= g5(23 downto 0); fpga_size <= b"10"; fpga_rw <= ’0’; mem_req <= ’1’; present_state <= 21; when 21 =>mem_req <= ’0’; if mem_fin = ’1’ then present_state <= 22; else present_state <= 21; end if; when 22 =>g4:= x"00000001" + g4; present_state <= 23; when 23 =>tmp:= g4 - g3; present_state <= 24; when 24 =>if tmp(31) = ’1’ then present_state <= 20; else present_state <= 17; end if; end case; end if; end process fsm; end rtl_level;
217
References
A. A. Jerraya, K. O’Brien and Ismail, T. B. (1993). Bridging the gap between case tools and ic cad tools. In 2nd. IFIP, International Workshop on Hardware/Software Codesign – CODES/CASHE’93, Innsbruk, Austria. ACM/IEEE. A. Ali, et al. (1995). Dynamically accelerated software/hardware environment. MSc Systems Engineering Final Report. A. Alomary, et al. (1993). Peas–i: A hardware/software co–design system for asips. In European Design Automation Conference with EURO–VHDL, pages pp. 2–7, Hamburg, Germany. ACM/IEEE. A. S. Wenban, J. W. O’Leary and Brown, G. M. (1992). Codesign of communication protocols. In International Workshop on Hardware/Software Codesign, Estes Park, Colorado. ACM/IEEE. AMD, Advanced Micro Devices (1990). Memory products data book. Amdahl, G. M. (1967). Validity of single–processor approach to achieve large–scale computing capability. In AFIPS Conference, pages pp. 483–485. Amon, T. and Borriello, G. (1991). Sizing synchronization queues: A case study in higher level synthesis. In 28th. Design Automation Conference, pages pp. 690–693. ACM/IEEE. Association, Electronic Industries (1989). Electronic design interchange format. B. Nikkhah, L. de M. Mourelle, M. D. Edwards and Forrest, J. (1996). Software region selec´ tion for hardware implementation in a software acceleration environment. In IX Simposio ˜ de Circuitos Integrados – I Workshop Brasileiro de Hardware/ Brasileiro de Concepc¸ao Software Codesign, pages pp. 397–408, Recife/Brasil. Brayton, R. (November 1987). Mis: A multiple–level logic optimization system. IEEE Transactions on Computer–Aided Design, v. CAD–6:pp. 1062–1081. Buchenrieder, K. (January 1993). Hardware–software codesign – codesign and concurrent engineering. IEEE Computer, pages pp. 85–86. Buchenrieder, K. and Veith, C. (1992). Codes – a practical concurrent design environment. In International Workshop on Hardware/Software Codesign, Estes Park, Colorado. Buchenrieder, K. and Veith, C. (1994). A prototyping environment for control–oriented hw/sw systems using state–charts – activity–charts and fpgas. In European Design Automation Conference with EURO/VHDL’94 – EURO/DAC’94, pages pp. 60–65, Grenoble, France. IEEE Computer Society Press. Chiodo, M. and Sangiovanni-Vincentelli, A. (1992). Design methods for reactive real–time systems codesign. In International Workshop on Hardware/Software Codesign, Estes Park, Colorado.
219
220
References
Coli, V. J. (September 1993). Fpga design technology. Microprocessor and Microsystems, v. 17: pp. 383–389. Cooling, N. (May/June 1995). Software/hardware co–design: Mirage or nirvana? Embedded Systems Engineering, pages pp. 42–45. D. Becker, R. K. Singh and Tell, S. G. (1992). An engineering environment for hardware/software cosimulation. In 29th. ACM/IEEE Design Automation Conference, pages pp. 129–134. ACM/IEEE. D. D. Gajski, F Vahid, S. Narayan and Gong, J. (1994). Specification and Design of Embedded Systems. Prentice Hall. D. E. Thomas, J. K. Adams and Schmit, H. (September 1993). A model and methodology for hardware/software codesign. IEEE Design & Test of Computers, pages pp. 6–15. D. E. Van den Bout, et al. (September 1992). Anyboard: an fpga–based reconfigurable system. In IEEE Design and Test of Computers, pages pp. 21–30. de M. Mourelle, L. (1998). hardware/Software Interfacing in a Software Acceleration Environment. PhD thesis, University of Manchester, Institute of Science and Technology. PhD thesis. de M. Mourelle, L. and Edwards, M. D. (1997). Determining the interface mechanism in a codesign system based on memory access characteristics. In EUROMICRO’97, Short Contributions, Budapest, Hungary. IEEE. E. Martin, O. Sentieys, H. Dubois and Philippe, J. L. (1993). Gaut: An architectural synthesis tool for dedicated signal processors. In European Design Automation Conference with EURO– VHDL, pages pp. 14–19. ACM/IEEE. Ecker, W. (1993a). Hardware/software co–specification using vhdl. In 2nd. IFIP International Workshop on Hardware/Software Codesign – CODES/CASHE’93, Innsbruk, Austria. ACM/IEEE. Ecker, W. (1993b). Using vhdl for hw/sw co–specification. In European Design Automation Conference with EURO–VHDL, pages pp. 500–505, Hamburg, Germany. Edwards, M. D. and Forrest, J. (1994). A development environment for the cosynthesis of embedded software/hardware systems. In European Design and Test Conference – EDAC’94, pages pp. 469–473, Paris, France. IEEE. Edwards, M. D. and Forrest, J. (1995). Software acceleration in a hardware/software codesign environment. In 21st. EUROMICRO Conference on Design of Hardware/Software Systems, pages pp. 727–733, Como, Italy. IEEE Computer Society Press. Edwards, M. D. and Forrest, J. (1996a). A practical hardware architecture to support software acceleration. Microprocessors and Microsystems, v. 20:pp. 167–174. Edwards, M. D. and Forrest, J. (January 1996b). Software acceleration using programmable hardware devices. In IEE Proceedings in Computer and Digital Technology, volume v. 143, pages pp. 55–63. Ernst, R. and Henkel, J. (1992). Hardware/software codesign of embedded controllers based on hardware extraction. In International Workshop on Hardware/Software Codesign, Estes Park, Colorado. ACM/IEEE. F. Balarin, et al. (1997). Hardware–Software Co–Design of Embedded Systems: A Polis Approach. Kluwer Academic Press. G. Borriello, et al. (March 1993). Hardware/software codesign: a d&t roundtable. IEEE Design & Test of Computers, pages pp. 83–91. G. De Micheli, D. C. Ku, F. Mailhot and Truong, T. (October 1990). The olympus synthesis system. IEEE Design and Test of Computers, pages pp. 37–53. Goosens, G. (1993). Integration of medium–throughput signal processing algorithms on flexible instruction–set architectures. VLSI Signal Processing, Special Issue on Synthesis for DSP. Gupta, R. K. (1993). Co–synthesis of hardware and software for digital embedded systems. STAN//CSL–TR–94–614.
References
221
Gupta, R. K. and Micheli, G. De (1992). System–level synthesis using re–programmable components. In 3rd. European Design Automation Conference, pages pp. 2–8, Brussels, Belgium. IEEE. Gupta, R. K. and Micheli, G. De (1993). Hardware/software cosynthesis for digital systems. IEEE Design and Test of Computers, pages pp. 29–41. Hennessy, J. L. and Patterson, D. A. (1990a). Fundamentals of Computer Design, Computer Architecture: a Quantitative Approach. Morgan Kaufmann Publisher, San Mateo, California. Hennessy, J. L. and Patterson, D. A. (1990b). Memory Hierarchy Design, Computer Architecture: a Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, California, 2nd. ed. edition. Hilfinger, P. (1985). A high–level language and silicon compiler for digital signal processing. Custom Integrated Circuits, pages pp. 213–216. Hoare, C. A. R. and Page, I. (1994b). Hardware and software: the closing gap. In International Conference in Programming Languages and System Architectures, pages pp. 49–68, Zurich, Switzerland. Springer–Verlag. Hoare, C. A. R. and Page, I. (June 1994a). Hardware and software: the closing gap. Transputer Communications, v. 2:pp. 69–90. Hwang, K. and Briggs, F. A. (1985). Memory and input/output subsystems. Computer Architecture and Parallel Processing, pages pp. 52–144. I. Park, K. O’Brien and Jerraya, A. A. (1992). Amical: Architectural synthesis based on vhdl. In IFIP Workshop on Control–Dominated Synthesis from RTL Description, Grenoble, France. IEEE (1987). IEEE standard vhdl language reference manual. Intel (1993). Pldshell plus/pldasm user’s guide: v3.0. Intel (1994). Programmable logic: Flexlogic fpgas, high–performance plds. J. Biesenack, et al. (September 1993). The siemens high–level synthesis system callas. IEEE Transactions on Very Large Scale Integration Systems, v. 1:pp. 244–253. J. Buck, S. Ha, E. A. Lee and Messerschmitt, D. G. (January 1990). Ptolemy: A framework for simulating and prototyping heterogeneous systems. International Journal of Computer Simulation – special issue on Simulation Software Development. J. Gong, D. D. Gajski and Narayan, S. (1993). Software estimation from executable specification. ICS–92–16. J. Rabaey, et al. (1988). Cathedral ii: A synthesis system for multiprocessor dsp. Silicon Compilation, pages pp. 311–360. Jay, C. (September 1993). Vhdl and synthesis tools provide a generic design entry platform into fpgas, plds and asics. Microprocessor and Microsystems, v. 17:pp. 391–398. Kalavade, A. and Lee, E. A. (1992). Hardware/software codesign using ptolemy. In International Workshop on Hardware/Software Codesign, Estes Park, Colorado. ACM/IEEE. Kalavade, A. and Lee, E. A. (1993). A hardware/software codesign methodology for dsp applications. IEEE Design and Test of Computers, pages pp. 16–28. Ku, D. and Micheli, G. De (1990). Hardwarec: A language for hardware design. CSL–TR– 90–419. Kuttner, C. (Fall 1996). Hardware–software codesign using processor synthesis. IEEE Design & Test of Computers, pages pp. 43–53. L. de M. Mourelle, B. Nikkhah, M. D. Edwards and Forrest, J. (1996). A comparison of two ´ Brasileiro de Concepc¸ao ˜ hardware/software control interfacing mechanisms. In IX Simposio de Circuitos Integrados – I Workshop Brasileiro de Hardware/Software Codesign, pages pp. 373–381, Recife/Brasil. Laboratories, AT&T Bell (1992). Bestmap user’s manual.
222
References
M. B. Srivastava, S. J. Sun and Brodersen, R. W. (1992). Hardware and software prototyping for application–specific real–time systems. In 2nd. International Workshop on Rapid Systems Prototyping, pages pp. 101–102. IEEE Computer Society Press. M. C. McFarland, A. C. Parker and Camposano, R. (1990). The high–level synthesis of digital systems. Proceeding of the IEEE, v. 78:pp. 301–318. M. D. Edwards, et al. (1993). A development system for hardware/software cosynthesis using fpgas. In 2nd. International Workshop on HW/SW Codesign, Austria. IFIP. M. Gokhale, et al. (1991). Building and using a highly parallel programmable logic array. IEEE Computer, v. 24:pp. 81–89. Micheli, G. D. and Gupta, R. K. (March 1997). Hardware/software co–design. Proceedings of the IEEE, v. 85:pp. 349–365. Micheli, G. De (January 1993). Hardware/software codesign – extending cad tools and techniques. IEEE Computer, pages pp. 85–87. Monjau, D. and Bunchenrieder, K. (1993). A new model–based approach to the co–design of heterogeneous systems. In 3rd. International Workshop on Computer Aided System Theory – EUROCAST’93, pages pp. 42–51, Las Palmas, Spain. MOTOROLA (1990a). Mc68300 family, cpu32 reference manual. MOTOROLA (1990b). Mc68332 user’s manual. N. S. Woo, A. E. Dunlop and Wolf, W. (January 1994). Codesign from co-specification. IEEE Computer, pages pp. 42–47. N. S. Woo, W. Wolf and Dunlop, A. E. (1992). Compilation of a single specification into hardware and software. In International Workshop on Hardware/Software Codesign, Estes Park, Colorado. ACM/IEEE. Narayan, S. and Gajski, D. D. (1992). Area and performance estimation from system–level specification. ICS–92–16. Navabi, Z. (1998). VHDL – Analysis and Modelling of Digital Systems. McGraw Hill. Nikkhah, B. (1997). A Hardware/Software Partitioning Scheme for Software Acceleration. PhD thesis, University of Manchester, Institute of Science and Technology. PhD thesis. P. Pochmuller, M. Glesner and Longsen, F. (1993). High–level synthesis transformations for programmable architectures. In European Design Automation Conference with EURO– VHDL, pages pp. 8–13. ACM/IEEE. Page, I. (1994). The harp reconfigurable computing system. Page, I. (1995). Towards a common framework for hardware and software: Hardware–software ´ Brasileiro de Concepc¸ao ˜ de Circuitos Integrados, co–synthesis at oxford. In IX Simposio pages pp. 3–16, Recife, Brasil. Page, I. and Luk, W. (1993). Compilation of programs into hardware. Park, C. Y. and Shaw, A. C. (May 1991). Experiment with a program timing tool based on source–level timing schema. IEEE Computer, pages pp. 48–57. Patterson, D. A. and Hennessy, J. L. (1994). Large and Fast: Exploiting Memory Hierarchy. Morgan Kaufmann, San Mateo, CA. Perry, D. L. (1991). VHDL. McGraw–Hill. R. Ernst, J. Henkel and Benner, T. (1993). Hardware/software cosynthesis for microcontrollers. IEEE Design and Test of Computers, pages pp. 64–75. R. K. Gupta, C. N. Coelho Jr. and Micheli, G. De (1992a). Program implementation schemes for hardware/software systems. In International Workshop on Hardware/Software Codesign, Estes Park, Colorado. ACM/IEEE. R. K. Gupta, C. N. Coelho Jr. and Micheli, G. De (1992b). Synthesis and simulation of digital systems containing interacting hardware/software components. In 29th. ACM/IEEE Design Automation Conference, pages pp. 225–230. ACM/IEEE.
References
223
R. K. Gupta, C. N. Coelho Jr. and Micheli, G. De (1994). Program implementation schemes for hardware–software systems. IEEE Computer, pages pp. 48–55. Rabaey, J. (June 1991). Fast prototyping of datapath–intensive architectures. IEEE Design & Test of Computers, v. 8:pp. 40–51. Rossel, W. Glunz and Kruse, T. (1993). Hardware/software codesign with sdl. In 2nd. IFIP, International Workshop on Hardware/Software Codesign – CODES/CASHE’93, Innsbruk, Austria. ACM/IEEE. Rushton, A. (1995). VHDL for Logic Synthesis. McGraw–Hill. Shaw, A. C. (July 1989). Reasoning about time in higher–level language software. IEEE Transactions on Computers, pages pp. 875–889. Siemens (August 1994). High–level vhdl optimization. Transformation and Analysis. Srivastava, M. B. and Brodersen, R. W. (1991). Rapid–prototyping of hardware and software in a unified framework. In European Design Automation Conference, pages pp. 152–155. Stoll, A. and Duzy, P. (1992). High–level synthesis from vhdl with exact timing constraints. In 29th. ACM/IEEE Design Automation Conference, pages pp. 188–193. Stroud, C. E. (1992). Problems associated with hardware implementation of software algorithms using behavioural model synthesis. In International Workshop on Hardware/Software Codesign, Estes Park, Colorado. Subrahmanyam, P. A. (1992). Hardware/software codesign: What is needed for success. In International Workshop on Hardware/Software Codesign, Estes Park, Colorado. ACM/IEEE. Subrahmanyam, P. A. (January 1993). Hardware–software codesign – cautious optimism for the future. IEEE Computer, page p. 84. Systems, Cadence Design (June 1995). Leapfrog user’s guide. Technologies, Innovative Synthesis (1995). Ist, asyl+/programmable logic synthesizer. Thomas, D. E. (1990). Algorithmic and Register–Transfer Level Synthesis: the System Architect’s Workbench. Boston, Kluwer. University of California, Department of Electrical Engineering and Science, Computer (January 1992). The almagest: Manual for ptolemy version 0.3.1. W. Wolf, A. Takach, C. Huang and Manno, R. (1992). The princeton university behavioural synthesis system. In 29th. ACM/IEEE Design Automation Conference, pages pp. 182–187. Wendling, M. and Rosenstiel, W. (1994). A hardware environment for prototyping and partitioning based on multiple fpgas. In Proceedings of the European Design Automation Conference with EURO–VHDL’94 – EURO–DAC’94, pages pp. 77–82, Grenoble, France. IEEE Computer Society Press. Wolf, W. (2002). Modern VLSI Design: System-on-Chip. Prentice Hall. Wolf, W. (2004). FPGA-Based System Design. Prentice Hall. Wolf, W. (September 1993). Hardware/software codesign– guest editor’s introduction. IEEE Design & Test, page p. 5. Wolf, W. W. (1994). Hardware/software codesign of embedded systems. Proceedings of the IEEE, v. 82:pp. 967–989. Wright, S. (1997). A complete description of the software to hardware translation tools synth and synth2. Xilinx (1992). Xc4000 logic cell array family. Xilinx (1993). The programmable logic data book. Xilinx (April 1994). Xact reference guide. York, T. A. (1993). Survey of field programmable logic devices. Microprocessors and Microsystems, v. 17:pp. 371–381.
Index
A
Abstraction, 2, 9, 20 Acceleration, 32 Accelerator, 45 Accumulator, 19 Acknowledge cycle, 39–40, 46–48 Am2130, 108 Amdahl’s law, 30 AMICAL, 20 Analyzer, 55 AND/OR arrays, 18 Arbitration, 40, 48, 107–109, 119–124 Architecture, 55–58, 60 ASIC, 1, 3–5, 10–xi, 17–18, 20–21, 23, 25–26 ASYL+, 21, 35 AT bus, 24 Average time, 97, 104, 106
B
BDM, 36, 51 Behavior, 53, 77 Behavioral, 2–4, 6, 10, 15, 20, 22–23 Benchmark, 50, 105, 159, 173–174 Block frame, 135–137 Block replacement, 177 Block size, 141, 153, 156, 158–164, 169–170 Block update, xiii, 157, 161, 165 Board, 36–37, 45 Bottleneck, 53 Buffer, xii, 59–60, 69, 71, 74, 76, 78 data, xii, 59–60, 69, 71, 74, 76, 78 output, 76 Bus, 20, 24–25, 36–39 Bus arbiter, 58, 60–61, 65–69 Bus arbitration, 40–41, 49 Bus contention, 44, 46–48, 107, 120, 130 Bus grant, 65–68
Bus asynchronous bus, 139, 145 system data bus, 76
C
C, 12–14, 14, 22, 29–33, 36–37, 42, 45–47 C++, 2, 12, 14, 22 Cache block, xii, 136–137, 154, 167 Cache hit, 136 Cache miss, xiii, 145, 151–155 Cache size, 138, 140–141, 158, 160 CAD, 5, 18, 54 Cadence, 20 CAE, 18 CALLAS, 20 CFB, 19–20 Channel, 25, 38 Characteristics, 7–8 Chip, 17–18, 26 Circuit, 1–3, 15–17, 23, 26 CLB, 18, 21 Clock divider, 45, 59–60 Clock frequency, 61 Clock generator, 74–75 CMOS, 18 Co-design, 2, 4–7, 9–xi, 14, 19, 21, 23, 26–27, 29–31, 36–37, 49–50, 53–54, 58, 68, 74, 77, 81, 85, 105, 133, 169–170, 173–174, 176–178, 181–182 Co-simulation, 11, 178 Co-synthesis, 29 Coarse-grain level, 54 CODES, 6 Communication, 1, 6–7, 12, 14–15, 20, 22–25, 38, 48, 54, 60, 176–177 Compilation, 31, 36, 50 Compiler, 16, 20, 22 Concurrent statement, 56
225
226 Concurrent system, 6 Connectivity, 18 Contention, 107–110, 120 Control flow, 35 Control register, 43–45, 48 Controller, 20, 24, 138–140, 142, 144–145, 156 dual-port, 24 Coprocessor, 36–37, 40–41, 43–50, 81–82, 85, 95, 98, 100, 104–105, 107, 111, 113 Correctness, 2, 5 COSMOS, 20 COSYMA, 11, 15 Critical region, 13, 22, 29–33, 35, 37, 42, 45, 49, 74, 173, 177 CWAVES, 77, 88 Cycle, 110, 113, 118–120
D Data flow, 76 Decode function, 47 Decoder, 22 Design cycle, 54 Design unit, 55 Deterministic, 31 Development cycle, 1 Device, 17–20, 22–23 Digital, 12, 20, 23, 26 Digital Signal Processor, 1 Digital system, 1 DMA, 19, 24 Domain-specific application, 5 DRAM, xii, 20, 25, 36, 40–42, 59–60, 72, 74, 111–112, 115–116, 118, 133, 175–176 DSP, 1, 5, 22–23
E EDIF, 35 EGCHECK, 45, 47–48, 81, 173–174 Embedded controller, 22 Embedded systems, 1, 6–7 Entity, 55–58, 60 port, 55–56, 61 port declaration, 60 EPROM, 17, 19 Estimation, 32–33 Exception, 40–41 Execution, 3–4, 7 Execution time, 81, 85, 93–100, 102, 104, 121–122, 125–128, 130 Experimental result, 7
Index
F
Falling edge, 116, 124 FIFO, 19, 22 Flexibility, 1, 3–4, 6 FLEXlogic, xi, 19–21, 36, 41, 45, 59 Flip-flop, 55, 57 Floating-point operation, 5 Flow graph, 14, 22 Flowchart, 60, 75–76 Formal method, 2, 6 FPGA, 1, 3, xv, 17–21, 25–27, 29–31, 33, 35–36, 41, 43, 45, 51, 59, 62, 75–76, 86, 176–177 FPLD, 17 Frequency, 30, 35, 45, 47–48 Function, 9, 22 Functional, 9–10, 15, 23 Functionality, 9, 13, 25, 54–55, 77
G
Gate array, 18, 26, 29 General-purpose application, 5
H
Handshake completion, 47–48, 85, 90, 92, 94–95, 100, 102, 105, 119–122, 125–128, 174–176, 178, 180–182 Handshake protocol, 44, 49 Hardware, 1–4, 9–16, 20, 22–26, 29–33, 35–36, 42–43, 45, 49–50, 54, 59, 82, 85, 108, 130 hardware function, 37, 43, 45, 49 sub-system, 6–7 Hardware/software, 9–11, 13, 15, 29, 46 HardwareC, 2, 12, 14, 20 Harwdare sub-system, 29, 33, 42, 49 HDL, 54 High-level programming, 2 High-level specification, 2–3 Hit, 134 hit rate, 134
I
IEEE, 54 IFX780, xi, 19–21, 36, 41, 45 IMB, 38 Implementation, 9, 11, 13, 15, 22 Instruction cycle, 60–61 Instruction level, 2 Intel, xi, 19–21, 59
227
Index Interface, 4, 7–8, xi–xiii, 15, 19–25, 33, 36–38, 42–45, 47, 49–51, 53–55, 58–62, 69, 77, 81, 85, 92–95, 98–99, 102, 105, 107, 111, 114–115, 119–123, 126–128, 130, 138–140, 142, 144–146, 148, 150–152, 154, 156–157, 159–164, 166–167, 171, 173–179, 181–182 busy-wait, xi–xii, 42–48, 50, 53, 62–64, 81, 83, 85, 87–91, 94–98, 100, 103–106, 120–122, 125–130, 152, 157, 160–162, 166–167, 169–170, 173–175, 178, 180–182 interrupt, 42–43, 45–48, 50, 53, 63, 81, 85, 87, 90–91, 95–100, 104–106, 120–123, 125–130, 152, 157, 160–163, 166, 168–170, 173–175, 178, 180–182 Internal operation, xv, 121–122, 125–130 Interrupt-driven applications, 32 Interrupt, 62–64, 79, 91–92, 120, 128, 145, 168, 180 IOB, 18 IPC, 22 Iterations, 82, 94
request, 102–103, 114, 142 shared, 8, 81–82, 87, 105, 107, 111–112, 118, 129–130, 133–135, 150, 169 single-port, 8, 108, 119–120, 128–130, 133, 138, 166 static memory, 175 Mentor Graphics, 19–20 Methodology, 2, 5–7, 9–11, 27, 29–30 Microcomputer, 29 Microcontroller, 2, 22–23, 29–30, 36–38, 40–41, 43–46, 48, 50, 58, 60–61, 63–66, 68–69, 71, 74–76, 79 Microelectronics, 1 Microprocessor, 25–26 Minc, 20 Miss rate, 134, 160, 162–165, 167–170, 177 Modeling, 54, 59 Modifiability, 6 Module, 130 Monitoring, 41 Motorola, 36 Multi-level specification, 54
L
Latch, 22, 71–73 LEAPFROG, 77, 88 Library, 55, 58 Logic gate, 23, 26
M
Match, 136–137 MC68332, 45, 50, 85, 139, 141 MCU, 38–41, 44, 46–48 Memory, 1, 4, 7–8, xi, 13, 18, 20, 24–25, 32, 36–37, 39–40, 44–47, 58, 61, 72–73 Memory controller, 111, 113–114, 116, 123–124, 131 Memory refresh, 74, 79 Memory accesses, 53, 59, 77, 81, 85, 93, 134, 144 architecture, 105 cache, 8, 130, 133, 138, 140, 144, 148, 150–151, 155, 158, 163–167, 169–171 configuration, 8, 78, 107, 111, 118–120, 128–130 controller, 141, 144–146, 148 coprocessor, 8, 76, 95, 102, 107, 130, 133, 137–138, 142, 145, 148 cycle, 97, 104 DMA, 24 dual-port, 107–109, 111, 113–114, 118, 120, 128, 130, 133, 138, 170 hierarchy, 8, 133–135 mapped, 23, 25, 61, 81–82, 85–87 read/write, 60–63, 65–69, 71–72, 82, 85–86, 94, 139
N
Netlist, 20–21, 31, 35, 54, 59, 74
O
Off-chip, 18, 20 OLYMPUS, 11, 20–21, 24 On-chip, 18, 38, 108–109 Operating system, 7 Operation, 109–110, 113, 124 OrCAD, 19–20 Organization, 135–137 Overhead, 44, 46, 48 Overlap, 139
P
Package, 55 PALASM, 59–60, 77 Parallelism, 7 Parameter, 113, 121–122, 127, 129 Partition, 10, 13–15 Partitioning, 1, 4, 10, 13–15, 22–23, 30, 33, 35–36, 45–46, 50, 59, 82, 176–178 Performance, 1, 3–8, 29, 31, 42, 44–50, 81, 93, 105, 107, 121, 127, 129–130 Physical experiment, 53 Physical implementation, 3, 7, 64, 75, 77 PIT, 45–48 PLD, 18, 21, 28
228 PLDshell, 20 PLS, 21, 35 PLUM, 45–47, 81, 173–174 POLIS, 11, 15, 22–23 Port, 108–112, 119–120 POSEIDON, 23 Principle of locality, 177 PROCESS, 57 Profiling, 13, 15, 22, 30–33, 36, 50, 177 Program, 1–2, 5–6 Program diagnostics, 36 Programmability, 1 Programmable logic, 18 PROMELA, 12, 22 Protocol, 12, 15, 22–23, 33, 36, 38, 40, 43–44 Prototyping, 6, 29 PTOLEMY, 10–12, 15, 21–23, 178
Q
QSM, 38
R
RAM, 17–18, 26, 41, 108, 133 Read cycle, 139 Reconfigurable, 25–26 Recursive search, 1 Refresh, 113 Register-transfer, 16, 20 Register-transfer level, 54 Register, 107, 113, 118–120 Replacement rule, 136 Resource, 54 RISC, 25 RTL, 54
S
Schematic, 19 SDF, 12, 15, 21–23, 28 SDL, 2, 12, 20 Sensitivity list, 57, 64, 74–76 Sequencer, 58, 60–64, 69 Serialization, 4 SILAGE, 15 Simulation, 2, 6–8, 10–11, 14–15, 20, 23, 81, 88, 105, 108, 113, 121, 130 mixed-domain, 23 Simulator, 23, 88 Software, 1, 3–7, 9–16, 20–23, 25–26, 29–30, 32, 36, 44–45, 47–49, 64 sub-system, 6–7, 36–37, 42, 45, 49 Specification, 9–16, 20–23
Index Speedup, 30, 42, 46–47, 49–50 SPI, 36 SPLASH, 27 SRAM, 18, 20, 36, 40, 138 State machine, xi, 60, 66–72, 74, 112–114, 116 asynchronous, 112 synchronous, 114 Statistics, 31–32 Structural, 2, 6, 20, 24 Synchronization, 5, 7, 106 Synthesis, 2–7, 9–xi, 13, 15–16, 20–24, 30–31, 33–36, 45–46, 50, 54, 59, 82, 176–177 Synthesizer, 55 System clock, 61, 64, 66, 68–69, 71–72, 75 System functionality, 4 System level, 2
T
Target architecture, 7, 19, 25, 33, 37, 49–50, 173, 176 Target environment, 32 Testbed, 26 Thor domain, 23 Thor simulator, 23 Time stamps, 32 Timing characteristics, 85, 91–92 TPU, 38, 41 Trade-off, 134, 176 Translation, 30, 32 Tree data structure, 1
U
UMIST, 5, 7
V
Verification, 2 VHDL, 2, 7–8, xi–12, 20–21, 33–35, 50, 53–55, 57–62, 74, 77, 81–83, 85, 88, 131, 145, 148, 150, 174–176, 178–180 Viewlogic, 19–20 VOTAN, 34–35
W
Wave viewer, 77 WORK, 55, 58 Working design library, 55 Write back, 137–138, 154, 161–162 Write cycle, 139 Write through, 137
Index
X X4010, 36 XACT, 19, 35, 45 XC4000, xv, 18–19
229 XC40xx, 33, 35 Xilinx, 18–19, 21, 25, 33, 35–36, 43, 45, 51, 59, 74–75, 86 XNF, 19, 35, 51, 59, 74