Platform Based Design at the Electronic System Level
Platform Based Design at the Electronic System Level Industry Perspectives and Experiences
Edited by
Mark Burton GreenSocs Ltd, France
Adam Morawiec ECSI, Grenoble, France
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN-10 ISBN-13 ISBN-10 ISBN-13
1-4020-5137-9 (HB) 978-1-4020-5137-1 (HB) 1-4020-5138-7 (e-book) 978-1-4020-5138-8 (e-book)
Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. www.springer.com
Printed on acid-free paper
All Rights Reserved © 2006 Springer No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.
CONTENTS
Foreword: Enabling Platform-Based Design Grant Martin 1. The Need for Standards Mark Burton, and Adam Morawiec
vii
1
2. Programmable Platform Characterization for System Level Performance Analysis Douglas Densmore, Adam Donlin, and Alberto Sangiovanni-Vincentelli
13
3. Use of SystemC Modelling in Creation and Use of an SOC Platform: Experiences and Lessons Learnt from OMAP-2 James Aldis
31
4. What’s Next for Transaction Level Models: The Standardization and Deployment Era Laurent Maillet-Contoz
49
5. The Configurable Processor View of Platform Provision Grant Martin
59
6. Peripheral Modeling for Platform Driven ESL Design Tim Kogel
71
7. Quantitative Embedded System Architecture and Performance Analysis Graham R. Hellestrand
v
87
FOREWORD: ENABLING PLATFORM-BASED DESIGN
My first exposure to the concept of Platform-Based Design (PBD) of integrated circuits, more particularly of the emerging “System-on-Chip” or SoC devices, was at lunch in the Cadence cafeteria in the mid 1990’s, when a colleague who had worked for VLSI Technology, Andy McNelly, explained their concept of a design “platform”. Among those who were most influential in advancing this concept at VLSI Technology was Bob Payne, who coined the still memorable phrase “deconfiguration is easier than configuration”, in describing his concept of Rapid Silicon Prototyping (RSP). Indeed, VLSI’s leadership in what came to be called “Platform-Based Design” was one key reason for them being bought by Philips in 1999. At that time, platform-based design was an interesting concept that was far from being universally accepted as a key way to do SoC design. Indeed, the concept of SoC design, and the reuse of intellectual property (IP) in the design of complex chips, was itself just in its infancy. Many designers and observers of the design scene put their faith in other approaches such as block-based design: ad-hoc IP assembly of new and reused design blocks as the situation demanded, for integrated circuits that were designed new each time, with little sharing or reuse of underlying architectures. But there was something just too compelling about the Platform-Based Design concept to ignore. Indeed, this whole idea of a design process promised four important advantages over the more conventional top-down and block-based design methodologies then in widespread use: 1. It offered a process for IC implementation that could become regularised, standardised, highly productive and with much lower risk. Basing designs on an “integration platform architecture” with libraries of pre-proven IP blocks was a powerful method to bring order to the chaos of the normal custom design-driven IC design process. 2. It brought order and methodology to the emerging field of IP reuse. Blocks could be designed for reuse in a known family of architectures, with more predictable interfaces, rather than being designed to fit in everywhere, usually by risky ad-hoc interface design for each new chip. Platform-Based Design promised to be the best enabler of IP reuse. 3. It gave a level playing field, or sandbox, in which hardware designers and software developers could interact. Already in the mid 1990’s we saw the development of processor-based platforms, cellphone baseband chips, for example, used a combination of RISC, DSP, hardware blocks, memories and buses as the basic architecture. Platform-based design offered better-thought out and vii
viii
FOREWORD: ENABLING PLATFORM-BASED DESIGN
controlled hardware variability, giving the software people less to worry about in their hardware-dependent firmware. And moving processor choice from freeform to a well-thought out evolution gave the hardware people less novelty with each new design, lowering risk and enhancing productivity. 4. Finally, it promised to evolve up towards new system-level design abstractions that would offer much greater productivity and an ability to design derivatives from platforms with much greater speed and lower risk than the classical RTL and C abstractions of existing design methods. Although system-level design existed before platform-based design, the marriage of the two seemed like a marriage made in heaven. Advocating Platform-Based Design as a key methodology for SoC began to seem less like a wild and crazy new adventure, and more like “good old Northern common sense”. Although there were skeptics among the industry analysts, mere observation indicated that more and more design teams and IC design companies were beginning to explore this methodology and adapt and adopt it to their particular product design philosophies. It was the basis for the SoC design methodology that Cadence developed for the Alba project in Scotland. It first seemed like a good way to “survive” the SoC revolution, and then seemed more and more like the best way to “win” this transition in the industry. Now, a decade later, what is the current situation in complex SoC design? First, it seems clear that Platform-Based Design is a well-accepted methodology, even the skeptics have hopped aboard the train. If one looks out in the industry, one can see many many examples of this approach used by different companies and teams. Philips Nexperia, ST Nomadik, TI OMAP, Xilinx Virtex II/IV, Altera SOPC, Infineon, Freescale, Samsung, Toshiba, Sony, and many many more chips, design groups and companies are all out there designing and using platforms successfully. From the 2006 viewpoint, then, what remains to be done? We can see evidence that the first three promises of Platform-Based Design have been met: 1. We see that IC design processes have become much more regularised. Indeed, we see design service companies offering IC implementation services as partners to global design companies with high-level handoffs and high success rates. 2. We see substantial, extensive and growing IP reuse. From processors of all kinds to memories, from buses to hardware blocks, from device drivers to middleware to software applications, hardware and software IP reuse has flourished. It may not have evolved in the directions first envisaged by some of the early dreamers – we don’t see IP being designed in every garage in Silicon Valley by independent contractors, for example – but it flourishes nevertheless. 3. We see much more processor centric design. Instruction set processors, whether fixed or configurable and extensible, are the new lingua franca of complex electronic product design – the place where complex functionality is defined and realized. Creating architectures where processors can flourish and prosper is the true mission of hardware design. But we’re missing something what about system level design (SLD) and the move to higher design abstractions? Has this occurred? Are all complex platforms
FOREWORD: ENABLING PLATFORM-BASED DESIGN
ix
modeled at an abstract level? Do all derivative products get designed and verified at a high level so that the rest of the design flow becomes “mere implementation” or “an exercise left to the readers”? I think we can say no to this question, but offer a cautious maybe as the answer to this question a year or two from now. Despite being a hypothetical marriage made in heaven, the union of PBD and SLD, to become the PBSLD family, seems to have been ever protracted. In particular, one can say that SLD (or “Electronic System Level” – ESL design, as it has been renamed) has been “always the bridesmaid, never the bride”. Early efforts were made by busy matchmakers to bring this couple together, but they did not succeed. No doubt there were many reasons for this failure. Among them was surely the level of complexity of early platforms – they were not complex enough to design that use of SLD methods was essential. Rather, traditional methods of muddling through were good enough. Early baseband processors in cellphones may have had both a RISC and DSP – but these were relatively isolated subsystems with relatively simple interactions, and the software load was (relatively) small. Because the interactions were simpler and the amount of software smaller, the verification task was more tractable using simpler and traditional design methods. Early SLD tools of the later 1990’s and early in the new millennium also lacked standard specification languages and model interoperability, which discouraged models from being created, since they would need to be redone in each new proprietary language or format that came along. And every company or design group had a different notion of what abstractions were appropriate. Transaction level modeling started to be advocated, but in its early days, “it all depended on what your definition of a transaction was”. So, for many good reasons the PBD-SLD marriage was not necessary then. Why is it a good idea now, in 2006? Several reasons come to mind: 1. Designs are much more complex. Leading SoCs have many processors – often 3, 4 or more, as well as many memories, complex bus and memory access hierarchies, many hardware accelerators and peripheral blocks and many external interfaces. Verifying designs of this complexity with traditional methods has failed. 2. Software content of advanced SoCs has grown enormously. You don’t put up to 10 processors or more on an SoC with many large memory blocks without wanting to fill that memory up with software. Software is increasingly the medium of choice for delivering advanced functionality to users. And you don’t verify software of hundreds of thousands or millions of lines of code complexity using traditional RTL or cycle-accurate ISS models alone. 3. Despite fits and starts, EDA industry politics and competing languages, and the opacity of its steering body, a system level language standard has emerged, with the ratification of SystemC by the IEEE as IEEE standard 1666–2005, and the publication of its Language Reference Manual at the end of March, 2006. Mirabilis of mirabilis, we have a standard system level modelling language – and a pretty fair shake at one, too. 4. We are also stumbling or slouching towards an industry consensus on design abstraction levels as well. The notion of Transaction Level Modelling, advocated
x
FOREWORD: ENABLING PLATFORM-BASED DESIGN
for several years, may, within the year 2006, become a common definition among the disparate community of system modelers, design companies, IP companies and language experts. The mills have been grinding slowly on this one, but one hopes they have been grinding towards an exceedingly fine result. Always an optimist even in my blackest moods, I would like to feel that the final roadblocks to the marriage of PBD and (now) ESL are being dismantled. And that leads us to the subject of this book. This book crystallizes a snapshot in the evolution towards the standard modelling notions and design abstractions that will enable high-productivity, lowrisk, ESL-based Platform-Based Design possible. It has been written by experts from a variety of perspectives (note to the gentle reader: I wrote a chapter, but would demur from the label “expert”). These include ESL tools companies, modelling companies, IP companies, semiconductor platform providers, independent consultants and academia. Naturally, as with any complex subject that has not quite reached a complete consensus, there are different notions and fine shadings and colouring on the standards, methods, and abstractions necessary for this emerging methodology. But the authors agree on several key points: 1. the need for standards in modelling and to enable model interoperability. 2. the idea of the Transaction as a key enabling abstraction level and enabler of the ESL methodology for PBD. 3. the need to give software writers a better and more realistic model of the target platform, and to try to offer them the right tradeoff options of speed vs. accuracy (speed being paramount). 4. the centrality of processor-based design and the corollaries of buses and other interconnect structures as being a key part of the SoC architecture. 5. the search for even higher level abstractions than the TLM model offers to allow platform providers to characterise their platforms in new ways, and designers to choose particular design tradeoffs in more efficient ways. With agreement converging or near on these key points, it seems that the vision of interoperable TLM models and higher level software prototyping models seems much closer to realisation. 2006 seems likely to become an annus mirabilis after all. I commend this book to the reader as the best way to dip their toes into the murky waters formed by the confluence of Platform-Based and Electronic System Level Design. Necessity may have forced the Shotgun Marriage of these two Partners, but Let Us Hope it is a Fruitful One, and gives Birth to Many New Opportunities. Grant Martin Santa Clara, California April, 2006
CHAPTER 1 THE NEED FOR STANDARDS
MARK BURTON1 , AND ADAM MORAWIEC2 1 2
GreenSocs Ltd., www.greensocs.com, ECSI, www.ecsi.org
1.
CONTEXT AND SCOPE OF THE WORK
This book presents a multi-faceted view of the set of problems that the electronic industry currently faces in the development and integration of complex heterogeneous systems (including both hardware and software components). It analyses and proposes solutions related to the provision of integration platforms by SoC and IP providers in light of the needs and requirements expressed by the system companies: they are the users of such platforms which they apply to develop their next generation products. Further, the book tries to draw a comprehensive picture of the current “interfaces” between the platform providers and users, defined by technical requirements, current design methodology and flows, standards, and finally by the business context and relationships (which should not to be underestimated). These producer-consumer, shared “interfaces” enable (or should enable) the exchange of a well-understood and complete set of data between both parties to ensure design efficiency, high productivity and best use of domain-specific expertise and knowledge. The problems to be solved are related to modelling of platform functionality and performance (formalisms, methods, metrics), interoperability of models, architecture exploration, early SW development in parallel to the HW platform instantiation, verification and debugging methods and flows, management of complexity at various abstraction levels, and the implications of the trade-offs between the accuracy and complexity of models. The solutions discussed by the contributors to this book have one common denominator: these are standards. In the general sense, the book provides views on why and what kind of standards are the prerequisite to the deployment of a platform based design ecosystem, in which cooperation is 1 M. Burton and A. Morawiec (eds.), Platform Based Design at the Electronic System Level, 1–11. © 2006 Springer.
2
CHAPTER 1
made possible between all parties involved in system development: system houses, platform and IP providers and EDA companies. 2.
OVERVIEW OF CONTRIBUTIONS
The complex problem of Electronic System Level design is presented in the chapters of the book from various perspectives. First, Densmore, Donlin and Sangiovanni-Vincentelli present in Chapter 2 an elegant overview of the design tasks associated with the Electronic System Level (see their Figure 2, Transcending RTL effort with ESL design technologies). They split the design task into essentially three main activities: 1. Early software development. 2. Functional verification and executable specification. 3. System-wide performance analysis and exploration. What is interesting about this list is that while it is split up by different people in different ways, the core elements are invariably the same. The split becomes interesting as different companies have taken different approaches to minimising the number of models they need to write. Hence some try to service all the activities with one model, while others use targeted models for each activity. They go on to focus on performance analysis, presenting a formal way in which they have characterised system performance. Addressing the same issue, Aldis presents in Chapter 3 a simulation framework that concentrates totally on the communication fabric. Again, he characterises the components of the system, or uses sampled data in order to drive traffic generation on the communication infrastructure. The interest of this approach is that he is able to build complete systems models of the OMAP platform in response to user requests in a very timely manner. The “modelling cost” is relatively low. His model only focuses on one aspect of design activity, but does it effectively. This style of modelling could be characterised as being just about the timing (often abbreviated as T). In chapter 4, Maillet-Contoz addresses software development and functional verification aspects of the design activity with somewhat more costly models. His models could be characterised as being just about the Programmers View (PV) of a system. He takes the standpoint that if the programmer cannot observe the effect, then it needs not be modelled. Indeed, somewhat stronger than this, if the programmer is not permitted to rely on an effect, then it may be modelled in any way that suits the model the best. This opens up an area not covered by the submissions to this book: namely, the prospect that models of system performance can be more useful than real hardware implementations. The case of a programmer not being permitted to make use of a hardware effect, but doing so none the less is all too common. Such hardware related “bugs” are often invisible until the underlying hardware is changed and the software is intended to remain the same. The worst case of this is where the hardware feature becomes non-deterministic, and the bug can then remain hidden for hours of runtime. Models can play an important role here in providing platforms
THE NEED FOR STANDARDS
3
that either check that the software is only relying on permissible features of the hardware, or “strain” the software to fail if it does not. An example of this would be a cache that either caches nothing or everything. While several contributions to this book have looked at how to evaluate performance, Martin describes in chapter 5 some of the options that can be chosen to better achieve the required performance. In his case, this is very much focused on the processor core itself, however it is apparent from all the contributions, and it is a theme that will be returned to, that platforms are not monolithic: to be successful, they have to be configurable. In turns Kogel shows in his contribution (chapter 6) how a consistent methodology can help in creating peripheral models that can be plugged into and out of a platform. This is essentially a part of the same platform configuration story. Kogel’s work focuses on the simulation activity: to continue the theme above, he addresses a combination of Programmers View and Timing (PVT). In chapter 7, Hellestrand shows how important this level of modelling is to achieve accurate simulation results for critical systems. He also points out how important it is to be able to achieve these results quickly, both in terms of the speed of writing a model and the speed of execution of the model, as typically, in order to fully characterise a system will take a large number of scenarios and a lot of code needs to be executed. Throughout this book, the authors have repeatedly returned to the subject of standards. In this brief summary, we will give an impression of why this is such an important issue in the field of platform-based design. Also, some of the aspects of standards bodies that help standards to be created in a timely and effective manner will be discussed. 3.
STANDARDS IN PBD?
If we stand back for a moment, it seems strange that standards are interesting at all for platform-based design. Indeed, one would expect that fewer, rather than more standards might be involved as the platforms themselves “envelop” more and more of the design space. Designers of IP blocks previously had to conform to standards on their peripheries, but, if the IP block is the platform, then the internal structure should not matter. This is anything but the truth. In reality, platform based design is not about all consuming platforms, rather it is about the ability to construct customised and tailored designs quickly, based on a common kit of parts. The debate about whether there will be one or many platforms misses the key issue. Each manufacturer may only support a single “platform” but the benefit of such a platform will be its ability to be highly configurable and customisable to individual applications. In order for this to become reality, standards become more important than ever. Now, IP blocks or entire sub-systems will be quickly (re)used in combinations that the original designers may not have thought about. It becomes important not only to have interconnectivity standards, but standards that support
4
CHAPTER 1
and expand the interconnectivity with the Quality of Service (QoS) in terms of bandwidth use, power, responsiveness etc. The second order effects of using IP blocks in previously un-imagined combinations becomes the most complicated to control. Modelling and validation are answers. However, again, there is a need for standards. Not only must models be interconnected, but they also need to be controlled and data must be collected and analysed. The means by which all of this is done needs to be standardised. Perhaps this is obvious. However, only some aspects of standardisation have even begun to be looked at by various, yet dispersed, bodies. For example, the SPIRIT organisation is looking at the way in which IP blocks can be described such that they can be interconnected, (currently at least at the RTL level). The OSCI TLM Working Group is developing mechanisms by which models can be connected together. But this is just a beginning. As an industry we need to become much better at constructing and running “standards factories” which will deliver the rich and diverse set of standards that we need. Today we are far from this ideal. Then, a similar effort should be invested to make sure these standards are accepted and deployed in product development flows to ensure that they are finally successful.
4.
THE STANDARDS LANDSCAPE FOR PLATFORM BASED DESIGN
Within the field of platform-based design, there are a few important standards bodies, each addressing a specific aspect of a modelling, design or verification (specification and modelling languages, design representation, interfaces).
4.1
OSCI
This book concentrates on the “electronic system level design” aspects of platformbased design. That is, the specification, modelling and verification of platforms. At the heart of system models is SystemC. For some it is “just glue”, others use it as the modelling language. In any case, SystemC is increasingly important. The Open SystemC Initiative was set up as a consortium of interested parties, predominantly EDA companies that had existing and competing C or C++ modelling solutions. They realised that in order to gain acceptance in the user community, and hence to sell their tools, a standard language was required. OSCI’s mission has been to standardise the SystemC language, using the IEEE as the ratification body. This has now been achieved (IEEE 1666–2005), and OSCI’s principle role has now become publicising and promoting the language. There are still some “outstanding” issues that need to be addressed, for instance better support for Transaction Level Modelling (TLM) which is being worked on by the TLM Working Group. Meanwhile, the language itself is now “owned” by the IEEE. OSCI is a not for profit organisation that does not supply any engineering resource, nor does it demand that its members contribute effort.
THE NEED FOR STANDARDS
4.2
5
Accellera
The development, updates and extensions of hardware description language (HDL) and other language-based design standards are the main objectives of the Accellera organisation. Created as a merger of VHDL International and Open Verilog International (OVI) Accellera maintains the existing HDLs: VHDL and Verilog, but also identifies and develops new standards like SystemVerilog, Open Verification Library and PSL (Property Specification Language). SystemVerilog moves into the domain of system development and provides powerful assertion mechanisms (SVA) for verification purposes. PSL has also gained a lot of attention for complete verification solutions that are based on formal (static) and traditional (dynamic simulation) methods. 4.3
SPIRIT
SPIRIT was set up to achieve one principal goal: to standardise an interchange format for IP blocks (at all levels of abstraction). The interchange format is XML based. The tools companies within the mixed consortium of SoC, IP and EDA companies (with 6 core members: ST, Philips, ARM and Cadence, Mentor and Synopsys) want to sell tools capable of building platforms from an IP library. Likewise the IP companies want to “stock” that library. This is absolutely the aim of platform-based design. SPIRIT is a consortium that offers free membership, and demands a certain level of commitment (engineering time) from its members. It is results-oriented and this approach seems to be very productive; however, the standards for higher abstraction levels seem to have taken longer than was expected, presumably due to a lack of the promised resources. 4.4
VSI Alliance
The VSIA is an open organization chartered with the development of SoC, IP and reuse standards to enhance the productivity of SoC design. Initially, VSIA focused mainly on lower level IP reuse standards for hard components (including various aspects of IP representation, verification, test, protection and quality). It also initiated a community-wide reflection on system-related issues and created the basis for interface-based design and on-chip bus standards, on which both SPIRIT and OCP-IP in effect built their activities. Currently, three principal domains are covered by the VSIA working groups (called Pillars): IP quality, IP protection and IP transfer. 4.5
OCP-IP
OCP-IP is an organisation that started with an existing proprietary standard (from Sonics, an interconnect IP company), with the intention of promoting and standardising this “socket”. The goal was that IP conforming to the socket would then
6
CHAPTER 1
be suitable for integration in a platform. OCP-IP has been relatively successful. The standard is only open to its members (Community Source Software), and membership is charged at a relatively significant amount. There is a very small amount of support given to the standards groups within OCP, but in the main, the members are expected to contribute time. Because OCP-IP started from a “dictated” initial point, the organisation has not suffered from multiple implementations being contributed, and has remained relatively productive.
4.6
OMG
The Object Management Group (OMG) is an open membership, not-for-profit consortium established in the SW domain that provides the foundation for a multiplatform Model Driven Architecture (MDA) with associated modelling specifications defined with the Meta-Object Facility (MOF), the Unified Modelling Language (UML). UML emerged as a multiple-view representation formalism in the object-oriented SW development domain. It allows the construction and analysis of several views of the system by means of structure diagrams (e.g. class, object, and component structure diagrams), behaviour diagrams (use case, activity and state diagrams) and interaction diagrams (e.g. sequence, communication, timing diagrams). Moreover, it enables the definition of domain specific extensions or specialisations via the profile mechanism. This is of a particular interest in the context of platform-based design, as this capability helps to create a liaison between high-level system specification and implementation paths (e.g. transaction level). Several such specific profiles are undergoing the standardisation process within OMG: SysML (requirements and system specification), MARTE (modelling and analysis of real-time applications), UML for SystemC (platform specific modelling of HW parts) and UML for SoC (hierarchical structure of modules). This profile standardisation is an ongoing process, yet far from complete in these cases. However, there is considerable support within the industry for this process for the SystemC related profiles, via an ECSI Study Group on UML for Embedded Systems and SoC that has brought a significant consolidation and focus of effort.
4.7
IEEE
The final organisation that must be mentioned is the IEEE. The IEEE ratifies standards across the Electronic Engineering industry. Of especially note, as mentioned above, they own SystemC, and now SystemVerilog (IEEE 1800–2005) and “e” (IEEE 1647). All are languages used in platform based design. IEEE is a wide membership organisation with low fees; they provide little support for standards to be developed, but request the voluntary assistance of their members. The contributions to the IEEE are public to their members, and the process by which standards are ratified is very clear.
THE NEED FOR STANDARDS
5.
7
STANDARDS ORGANISATIONS
There are many standardisation organisations in the electronics industry; but in reality, the evidence is that none of them effectively delivers all the standards the industry needs. Further, there is no cohesion between the standards bodies. The unwritten, and written, complaints by the authors of this book and their identification of the need for standards is testament to that unhappy fact. Fundamentally, the industry is nor producing standards at the rate that it itself requires. Over the next few sections of this chapter we will look more closely at the various factors involved in standardisation organisations and make some suggestions about how the process can be better streamlined. There are a number of key factors, which distinguish different standards organisations. Most of these factors are interrelated.
5.1
Means of Ratification
The first important distinction between different standards bodies is the means of ratification. There are essentially three approaches taken to this ratification. The standards body itself will publish the standard. Normally, the standards body will also “own” the standard, such that it is not “owned” by an industrial company who could then change the standard to exclude its competitors. A standards body is normally expected to be well known, and to have policies and procedures in place to ensure quality as well as its independence from any of the potential contributors to the standard. Examples of this include the IEEE. In many cases, a “standards organisation” is set up to build a specific standard. Such bodies are not “well know” and hence often their results are submitted to a standards organisation for ratification. Examples of this include OSCI and SPIRIT. Sometimes, organisations (companies, groups of individual, academics etc) create objects that just become widely used de facto standards. In many cases, there is little or no added benefit in a ratified standard. Examples of this include the gcc compiler chain that has come to “define” C++. In some ways, the de facto standard is very appealing; it seems to do without the tedious “ratification process” and often has powerful user support. However, in reality, while it may be a quicker route, there are hurdles to overcome. First, the donating organisation itself needs to provide a mechanism to ensure quality. Second, the organisation has to achieve user support. In doing so, such an organisation in effect becomes a full standards body. Boost is an example of such an organisation. It started providing de facto standards for C++ libraries, but now it has become well known, with a procedure in place to submit standards for ratification by Boost. All of these ratification mechanisms are, of course, pointless unless the standard is used. Finally, there is only one meaningful “ratification” and that is the use of the standard.
8 5.2
CHAPTER 1
Organisation Structure
The second key differentiator for standards organisations is their legal structure. Again, there are a number of choices. 5.2.1
For profit, and not for profit
For organisations wishing to become a standards body, it is often considered important to be “not for profit”. The idea is that this distinguishes competitors that might provide competing implementations of the standard from the standards body itself. There would be a commercial benefit in owning the standard if the owner also provided an implementation as they could exclude their competitors. The reality is that such standards are rare (though both Verilog and “e” started out in just this way). They are rare because competitors do not adopt them. Because they are not adopted, they are not used. It is only when the standard is compelling (which it was for both Verilog and “e”) that they are adopted. In this case, competing implementations, whether from the initial developer or not, are often forced to align by the users of the implementations. Interestingly in both the case of “e” and Verilog, subsequently the languages have been donated to a standards body. But simply demanding not for profit status of a standards body is not good enough to prevent it from “competing”. OSCI’s SystemC simulator is perhaps the biggest competition for the providers of SystemC proprietary SystemC simulators. In the end, the key is whether the standard is used, whoever owns it. If it is used, then the owner will be kept “in check” by the users. 5.2.2
Founders, members, customers, partners, open
Often as a result of agreement between competitors, standards organisations are born, with some form of “membership”. Sometimes the membership is only open to those that set the organisation up. Others collect dues from members, at various levels. Some, like the IEEE cover the costs of standards publication with widespread individual and corporate members. Others, like OSCI, use their industrial memberships to cover the cost of publicity. Finally, organisations such as Boost are open to anybody. Again, the key here is the degree to which the membership arrangements affect the use of the standard. It is not always the case that “open standards” will achieve greatest use. Sometimes demanding that organisations pay to use a standard can achieve widespread uptake (for instance OCP) – this perhaps perverse effect seems to be because of the greater “buy-in” required by the member company to the standards body: since they have paid money, they want to see results. 5.3
Consensus Versus Autocracy
There are standards organisations that take both approaches to decision making. Consensus organisations are typical of “consortiums” of member companies. Autocratic standards are more common in open organisations (strangely). The norm
THE NEED FOR STANDARDS
9
for the Linux community, for instance, is that one person (or small group) provides a “gate keeping” service to others in a wider community. In other words, the “gate keeper” decides autocratically on the “standard”, while receiving suggestions from a wider group. Of course revolts are possible, and “competing” standards emerge (for a short time), until one becomes the de facto standard (one recent example was the interface to CD-ROM drives under Linux, which for many years was implemented with a SCSI interface. Linus Torvalds suggested that a more simple IDE style interface was possible, implemented the code, and provided an “alternative” standard which has now become the de facto standard). 5.4
Support/Manpower
Finally, in our view the most important aspect of a standards organisation is: who does the work? Standards are not simply created. There is a lot of complex work to be done. Above we have glimpsed the number of standards that need to be created; however this is carried out, it will take work. 5.4.1
Voluntary effort
Most often “open” organisations, such as Boost, leave contributions to be donated. The people working on the contributions to ensure quality all do so voluntarily. This is an appropriate approach when there is a large enough group of people, having similar interests, involved, but it is difficult to set a time-scale or a roadmap for a “standard” to emerge in this way. 5.4.2
Contributed effort
A common approach to better control the process is to require members to contribute some time or effort to an organisation. This essentially increases the dues to the organisation to include both financial and engineering support. The difficulty with this approach is that, while the member organisations may be willing to contribute this time, increasingly it is often left to a few individuals. More seriously for the standards organisation, at this point, the “not invented here (NIH)” syndrome takes effect. The “not invented here” syndrome is a serious disruptive force in all standards bodies. It is exacerbated if the members of the organisation are also those contributed in the engineering effort, as clearly, their effort is reduced if the “standard” matches their own work. See section 4.5. 5.4.3
Provided effort
There are few standards bodies that provide engineering effort. OCP-IP provides some organisational assistance but no more. This is odd, as it may be the only logical way to unlock the NIH syndrome (in which individuals or organisations refuse to accept ideas, tools, or techniques simply because “they were not invented here”), and guarantee a reasonable velocity for the standards body. In effect those voluntary bodies that really do call on a stable and good sized number of contributors are
10
CHAPTER 1
the closest standards bodies that provide resource. An example of this might be the Linux kernel community. It is our belief that to properly implement a standard, in a timely and efficient manner, requires the central provision of “independent” resources.
5.5
Psychology of Standards Bodies
Standards bodies have a psychology of their own, probably worthy of study. There are typically personal and organisational issues at play. Organisationally, the internal politics of an organisation can lead it to partake in a standards effort simply to “derail” the process and delay (or prevent) a standard emerging. This catastrophic conflict of interest apart, there are invariably other smaller (and normally technical) issues that the participants to the standards body may be aware of, but may not be allowed to divulge to the standards body. In this way, participants can often seem to have to hold views that are not necessarily logical. Meanwhile, there are also personal views, aspirations and objectives. Foremost amongst these, as has been noted above, is the NIH syndrome. NIH is typically a force to be reckoned with during all discussions on all standards. Individuals always want to see their ideas accepted. We are all children seeking attention and reward. Spending time constructing code further enforces the commitment to an idea that an individual will have. In the end there are only two alternatives for a standards body: either to accept the contribution(s) and provide an all inclusive standard, or to provide a lowest common denominator standard. The former will suffer from too many features (e.g. SystemC), the latter will be too little to help (e.g. the initial OSCI TLM standard).
6.
STANDARDS PROCESS
While the development of a standard is fraught with pitfalls and personalities, there is a common process that is typically followed. The first hurdle is always, without fail, that of terminology and taxonomy. Indeed, the issue can continue throughout the process with different “terms” being used by different people at different times. Common terminology and taxonomy are hard to agree upon. This is made worse by the pseudo-political nature that terms can be subject to. For instance, a term may be used by an individual to “sell” an idea internally to their company. The company may then insist upon that “term” simply as they are sold on the idea. If it turns out that the term has been misused, or used differently by another person, then there is no easy resolution! Far fetched as this may seem, it happens regularly! To make it even more complex the same terms may be used to denominate different objects in different domains, and vice versa, different terms may be used to denote the same object. Often in the standardisation process people bring their specific domain knowledge and company traditions that are confronted with larger context of other domains, traditions and camps.
THE NEED FOR STANDARDS
11
Having reached some sort of understanding (never agreement) on terminology, standards bodies can be seen in two different “stages” of a life cycle. In the first, the standard lags behind the development and use of objects. These objects should be “donated” to the standards body. However, the standards body is paralysed by legal constraints, the NIH syndrome and a lack of resources to “amalgamate” the various proposals. The second “life stage” of a standards body is a more productive “steady state”. In this phase, the majority of the work has been done. However, there are user requirements and changes being asked for. In this state, there is much less conflict, and much more productive work. However, the impetus behind the initial drive is lost, and major changes cannot be undertaken because of a lack of commitment and resource. Somewhere between these two states lies a productive standardisation body tackling in some way all these diverging issues. 7.
SUGGESTIONS FOR STANDARDS BODIES
Given the overwhelming requirement for standards, it is vital for the platform based design industry that an effective mechanism is adopted to produce these standards. Many different combinations have been tried. From an objective standpoint it seems increasingly clear that the issue of providing engineering services is critical. In principle this should not only unblock the work required to be done, but also help to minimise the NIH syndrome. This implies a funded standardisation organisation with significant dues that can deploy engineering effort to ensure the progress of the standard. Often the autocratic style of de facto standards organisations is seen as being beneficial, and this again has not been applied to the EDA industry. Hence, it is the authors’ belief that an autocratic, funded organisation is needed to build the de facto standard for the platform based design industry.
This page intentionally blank
CHAPTER 2 PROGRAMMABLE PLATFORM CHARACTERIZATION FOR SYSTEM LEVEL PERFORMANCE ANALYSIS
DOUGLAS DENSMORE1 , ADAM DONLIN2 , AND ALBERTO SANGIOVANNI-VINCENTELLI1 1 2
University of California, Berkeley, Xilinx Research
1.
INTRODUCTION
It is customary to begin a discussion of Electronic System Level (ESL) design by stating at least one of the following observations [7]: • Time-to-market is a major influence on the design of most electronic systems; • Design complexity has outpaced designer productivity; • Verification effort now dominates design effort; • Custom design costs too much to be practical for most designs; • Register Transfer Level (RTL) design methods will not scale to address the design complexity; • Designers must work at higher levels of design abstraction to overcome design complexity; • Design re-use will be necessary to overcome design complexity. Figure 1 presents a graph relating design complexity to designer productivity with both RTL and ESL design methods. Today, most designers work with RTL design tools and languages. They find themselves in the ‘design gap’ where the system they are trying to create exceeds the capabilities of their design environment. This is not to say that the design gap cannot be crossed. On the contrary, the gap can be overcome with existing design methods but only at a significantly increased cost (both financially [5], [6] and in designer effort). Existing RTL design methods will continue to be employed until the additional cost of design overwhelms the 13 M. Burton and A. Morawiec (eds.), Platform Based Design at the Electronic System Level, 13–30. © 2006 Springer.
14
CHAPTER 2
Design Complexity (# Transistors)
Methodology Gap Max. Designer Productivity at ESL Maximum Tolerable Design Gap
Max. Designer Productivity at RTL Design Gap Today
Time
Figure 1. The Methodology Gap
commercial viability of the final design. This ‘maximum tolerable design gap’ varies per technology, per market segment, or even per product. A transition from RTL to ESL is required to completely overcome the design gap. A transition must occur since it is well accepted that design complexity will continue to increase. Today, the design community is approaching a point of inflexion between the two methods – the rate of ASIC design starts in recent years has remained relatively flat while implementation technologies continued to trend upwards [9]. An important question, therefore, is “what limits the widespread adoption of ESL by the majority of designers?”. A simplistic answer is that ESL design methods, tools, and languages are simply not mature enough to convince designers to risk traversing the gap between the two methodologies. To further complicate the answer, we must respect that ESL methods tackle multiple design problems and design projects influence the relative importance of each problem. Therefore, a complex compromise must be struck between the ESL vendors who create a set of tools and the system designers who use ESL tools to create and analyze models of their system. For example, the most common design tasks addressed within an ESL methodology are: • High speed functional simulation to allow early software development; • Executable specification of new system components and overall system architecture; • Performance analysis of a given system architecture; • Design space exploration of alternative system architectures; and • Functional verification of system components. In [1], the relative importance of ESL design tasks was explored for a variety of product scenarios. Crossing the methodology gap requires a suitable ESL technology be in place for each design task. Furthermore, ESL tools must address the design
15
PROGRAMMABLE PLATFORM CHARACTERIZATION
Design Complexity (# Transistors)
task in a way that matches the designer’s specific product scenario. An ESL solution for application-specific standard products (ASSPs) will differ from the solution for Structured ASIC. Figure 2 formulates, as an example, an ESL roadmap to support the transition of RTL designers to ESL. The exact number and sequence of steps varies according to the priorities of a given market segment. Each technology that is in place is a step up from RTL to ESL and a complete set is required for a smooth transition. If one or more of the steps are missing, the risk of migration will deter designers that are not close to their maximum tolerable design gap. Designers that reach their maximum tolerance before the ESL steps are available are in a pathological scenario because their product is no longer cost effective to develop. Naturally, the steps must also occur in a timely manner or system complexity will overtake productivity again. This chapter focuses on one specific step in the transition from RTL to ESL. ESL performance analysis allows designers to predict whether a given system architecture can meet a requested level of performance. Today, the two most common performance metrics are computational throughput and power consumption. In some markets, computational throughput is the dominant metric whilst, in others, power consumption takes pole position. As with all ESL design tasks, performance analysis relies on simulation or analysis of abstract system models to derive system performance in a given situation. Abstraction allows the system to be described early and at a reasonable cost but it also casts a shadow of doubt over the accuracy of performance analysis data. Since the data guide the selection of one system architecture over another, the veracity of data recovered from ESL performance analysis techniques must be weighed carefully by the designer. This is paramount for ESL acceptance and legitimacy.
Design Space Exploration
Max. Designer Productivity at ESL
Performance Analysis Functional Verification Early Software Development Max. Designer Productivity at RTL
Today Figure 2. Transcending RTL effort with ESL design technologies
Time
16
CHAPTER 2
Fear of inaccuracy in ESL performance analysis is a major impediment to the transition from RTL to ESL. Beyond the fundamental abstraction-accuracy tradeoff, current ESL methods and tools lack a coherent set of performance modeling guidelines. These guidelines are important because they allow a single system model to be reused over multiple ESL design tasks: the same basic model must be instrumented for performance analysis without significant compromise to its usefulness in early software development. Clear guidelines and coding standards also allow analytical data to flow out from a model into multiple ESL vendor tools. Close coupling of an instrumentation interface to a single ESL vendor is generally perceived as a bad thing unless the ESL vendor’s tools precisely complement the designer’s target market segment. Typically, average and worst-case cost estimates for system features are used in ESL performance models. The inaccuracy of these measures quickly accumulates in the data recovered from ESL performance analysis tools and restricts the designer to measure only the relative performance of two systems. In this chapter, we describe a novel technique that combines very accurate performance characterizations of a target platform with abstract models in an ESL design environment. We propose that characterization data recovered from actual target can be gathered easily and can be annotated into ESL models to enhance the accuracy of performance estimates. In the prototype of our approach, we select a specific modeling environment that allows one set of target characterizations to be exchanged for another to aid in the selection of a specific instance of the target architecture. 1.1
Chapter Organization
The remainder of the chapter is organized as follows: first, we discuss our technique’s general process and its set of requirements and assumptions. The next section offers more details on the pre-characterization process by discussing the automatic generation of the group of target systems. An important part of this discussion is also how to automate the extraction of re-usable performance data from the target system’s physical design flow. Storage, organization, and categorization of this data are discussed next and we give an example of pre-characterizing processor systems on Xilinx Virtex Platform FPGAs. The chapter concludes by summarizing our results and discussing future work. 2.
PLATFORM CHARACTERIZATION
Platform characterization is the process of systematically measuring a set of properties of a platform or its components. Ultimately, this data will be annotated to ESL models of the platform’s components. Subsequent simulations of the annotated model yield performance estimates that are correlated to the actual target platform. As such, they are more accurate than equivalent models annotated with ‘ballpark’ performance metrics. In short, platform characterization extracts a relevant group of performance measures from the physical implementation of a platform.
PROGRAMMABLE PLATFORM CHARACTERIZATION
17
Table 1. Performance Characterization Tradeoffs Tradeoff
TLM
RTL
Platform Characterization
Effort Portability Accuracy
Medium Medium Medium-Low
High Low High
Medium-Low High Medium-High
For characterization to be applicable to a system design, an appropriate, measurable implementation of the target platform must already exist. Clearly, ASIC designs are less amenable to our approach because a suitable implementation of the implementation target is not available early enough in the design process. Fortunately, programmable platforms based on FPGA technology and ASSP based systems are common targets for new design starts [9]. Both technologies are amenable to our approach. In the case of an ASSP, its physical properties and system architecture are fixed at fabrication. For FPGAs, the physical properties of the platform are fixed at fabrication, but the system architecture is flexible. Clearly we cannot apply our technique to the designer’s final platform: the system architecture for the designer’s product has not yet been determined. Instead, we pre-characterize a broad set of potential system architectures that the designer may use. Systems destined for this kind of target require the designer to choose the characterization data that is most representative of the system architecture they intend to create. Additionally, the designer can explore the effect of different system architectures through their characterizations. Systematically characterizing a target platform and integrating the data into the platform model is a three-way tradeoff between the following factors: • Characterization effort; • Portability of characterization data; and • Characterization accuracy. The more accurate a characterization, the more effort it will take to extract and the less portable it will be to other system models. Alternatively, a highly portable characterization may not be accurate enough to base a design decision on. Our process must offer more accuracy than standard transaction level approaches [1], require less effort than an RTL approach, and have more portability than an ASIC (static architecture) based target. Table 1 relates our approach (platform characterization) to RTL and transaction level modeling (TLM) with regards to the three main tradeoffs. 2.1
Characterization Requirements
In order to develop a robust environment for platform characterization processes there are several requirements: • Direct correlation between platform metrics characterized and architecture models – the designer must make sure that the models developed can be easily
18
CHAPTER 2
paired with characterization data and that the models have the essential behaviors which reflect the aspects captured by characterization. • IP standardization and reuse – in order to make characterization scalable, designs must use similar components. If each design is totally unique and customized there will not be existing information regarding its characterization. For example, Xilinx FPGAs employ the IBM CoreConnect bus and the CoreConnect interfaces on the IPs in their Xilinx Embedded Development Kit (EDK). This requirement will be expanded in the next section when extraction is discussed. • Tool flow for measuring target performance – whichever platform one chooses to model, the actual target device must have a method and tool flow to gather measures of the characterization metrics. • System Level Design environment with support for quantity measurements – the framework that the characterization data is integrated with must support the measurement of quantities (execution time, power, etc). In addition it must allow for the data used to calculate these quantities come from an external source. We will not focus on the details of each requirement. The discussion in this chapter will describe characterization in the context of the Metropolis design environment [4] and Xilinx Virtex FPGAs. Other design environments and platform types that meet the requirements above may also be characterized with our approach. The next section will discuss how to begin the process of gathering data for use in the characterization process.
3.
EXTRACTION OF SYSTEM LEVEL CHARACTERIZATION DATA
Extraction of data from the target platform’s physical tool flow is at the heart of our characterization methodology. Extraction is a multi-phase process concerned with: • Selecting a programmable platform family – by selecting a family of products in a programmable product line, one increases the opportunity that the extraction will be portable to other systems. Ideally, the selection of a platform family is done without regard to application domain but, in practice, this will influence the designer’s decision. An example of a programmable platform family is the Xilinx Virtex-4 family of platform FPGAs. • Selecting programmable platform components – the properties of the components will vary depending on the granularity and type of the programmable platform. For example an FPGA could consist of IP blocks, embedded processing elements, or custom made logic blocks. • Selecting systems for pre-characterization – from the selected components, assemble a template system architecture. From this template architecture create many other permutations of this template. In many cases permutation of the template architecture is automatic.
19
PROGRAMMABLE PLATFORM CHARACTERIZATION
FPGA Programmable Platforms 1. Select Target Device Family 2. Select Platform Components for Characterization in Target Systems
Virtex-II
UART
Master
Virtex-II Pro
Custom HW
6. Organize characterized systems into a reusable characterization database
ASSP1
Bus1
Master
ASSP2
Bus2
Master
Bus text
Slave
5. Characterize the physical properties of each system that was generated
Virtex-4
CPU1
3. Define a system architecture template
4. Generate multiple permutations of the platform components according to the architecture template
ASSP Programmable Platforms
Arbiter
Slave
Slave
Generate Architecture Permuatations
M
M
M
M
M
M
Bus
M
M
Bus
Bus
Bus
S
S
Synthetic_Sys_0
Synthetic_Sys_1
Synthetic_Sys_2
Synthetic_Sys_3
Physical Tool Flow
Physical Tool Flow
Physical Tool Flow
Physical Tool Flow
Tool Report s
Tool Report s
Tool Report s
Tool Report s
S
S
Collate Characterization Data
Platform Characterization Database Figure 3. Pre-characterizing Programmable Platforms
S
S
20
CHAPTER 2
Permutations can be made incrementally using various heuristics regarding the desired number and types of components. For example, one might want to constrain the number and type of embedded processors instantiated or the number of bus masters/slaves. The entire permutation space does NOT need to be generated. • Independently synthesize and implement each system permutation – the ability to quickly synthesize the created architecture instances is what differentiates programmable platforms from static, ASIC like architectures. Each of the systems should be pushed through the entire synthesis and physical implementation flow (place, route, etc). • Extracting desired information from the synthesis process and its associated analysis – the conclusion of the synthesis process will give information about each system. Such information includes (but is not limited to) various clock cycle values, longest path signal analysis, critical path information, signal dependency information, and resource utilization. Standard report processing tools like PERL can be used to automatically extract the appropriate information from the platform tool reports. Figure 3, below, illustrates our pre-characterization process. 3.1
Data Extraction Concerns
The issues that need to be observed during extraction are: • Modularity – After the initial selection of components and architecture template, the rest of the extraction can be performed by many independent extraction processes. These processes can be distributed over multiple workstations. This reduces the time to generate N permutations and characterize them constant where M is the duration of the longest permutation. • Flexibility – ultimately the extracted characterization data must be correlated to designs during simulation. Therefore the closer the permutated templates are to the actual designs the better. In most cases they will be identical but it is possible that some model designs will have parameters that differ from the characterized system. In the event that the differences do not affect the performance under test, the characterization data can be used. • Scalability – the extraction process is independent of the storage mechanism for the data so it in no way limits the amount of characterization data that can be extracted. Constraints can be placed on the permutations of initial template. Theoretically, all permutations of the target’s component library are candidates for characterization. Even though the characterizations can happen at the platform vendor well in advance of the designer using the data, the set of permutations will be constrained. This is necessary to maintain a reasonable total runtime for the overall extraction process. 4.
EXAMPLE PLATFORM CHARACTERIZATION
To exemplify our process, we pre-characterized a set of typical FPGA embedded system topologies [2]. Each topology was generated from a template to create a
PROGRAMMABLE PLATFORM CHARACTERIZATION
21
microprocessor hardware specification (MHS) file for the Xilinx Embedded System tool flow. We generated architectures with permutations of the IPs listed in Table 2. The table also shows the range in the number of IP instances that can be present in each system along with the potential quantities of each. In addition to varying the number of these devices, we also permuted design strategies and IP parameters. For example, we influenced the system’s address decoding strategy by specifying tight (T) and loose (L) ranges in the peripheral memory map. A loose range in the memory map means the base and high addresses assigned to the peripheral’s address decoder are wider than the actual number of registers in the peripheral. For a tight range, the converse is true. We also permuted the arbitration policy (registered or combinatorial) for systems that contained an On-Chip Peripheral Bus (OPB). These axes of exploration were used to investigate the relationship between peripherals and the overall system timing behavior. The columns of Table 2 show three permutation “classes” that were used. The implementation target was always a Xilinx XC2VP30 device. The first class (column Blaze), refers to designs where Blaze and OPB were the main processor and bus IPs respectively. The second class (column PowerPC) represents PowerPC and Processor Local Bus (PLB) systems. The third class (Combo) contain both Blaze and PowerPC. The number of systems generated is significant (but not unnecessarily exhaustive) and demonstrates the potential of this method. Note each system permutation can be characterized independently and hence, each job can be farmed out to a network of workstations. For reference, the total runtime to characterize the largest Combo system with Xilinx Platform Studio 6.2i on a 3GHz Xeon with 1 GB of memory was 15 minutes. The physical design tools were run with the “high effort” option and a User Constraint File (UCF) that attempts to maximize the system clock frequency. An observation of the characterization data shows that as resource usage increases (measured by slice1 count) the overall system clock frequency decreases.
Table 2. Example System Permutations
1
Component
Blaze
PowerPC
Combo
PowerPC (P) Blaze (M) BRAM (B) UART (U) Loose vs. tight Addressing Registered or Combinatorial Arbitration Total Systems
– 1–4 1–4 1–2 Yes Yes
1–2 – 1–4 1–2 Yes n/a
1–2 1–4 1–2 (per bus) 1–2 (per bus) Yes Yes
128
32
256
A slice contains two 4-input function generators, carry logic, arithmetic logic gates, muxes, and two storage elements.
22
CHAPTER 2
Figure 4. Combo systems resource usage and performance
Figure 4 shows a graph of sample Combo systems, their size, and reported performance. We used nested loops of each IP to generate the system permutations, giving the systems generated predictable increases in area and complexity. The major, periodic drop in area is as anticipated and indicates that a Blaze processor was added to the system topology and all other peripheral IPs were reset to their lowest number. Note that the graph’s performance trace is neither linear nor monotonic. Often area is constant while frequency changes drastically. This phenomenon prevents area based frequency estimations. The relationship between the system’s area utilization and performance is complex, showing that building a static model is difficult, if at all possible, and confirming the hypothesis that actual characterization can provide more accurate results. Table 3 highlights an interesting portion of the data collected in the PowerPC class. Each row is a PPC system instance: the leftmost columns show the specific IP configuration for the system and the remaining columns show area usage, max frequency, and the % change () between the previous system configuration (representing potentially a small change to the system). We contend that a difference of 10% is noteworthy and 15% is equivalent to a device speed-grade. Note that there are large frequency swings (14%+) even when there are small (< 1%) changes in area. This is not intuitive, but seems to correspond to changes in addressing policy Table 3. Non-linear Performance PPC Systems: P=Number of PPCs, B= Number of BRAMs, U=Number of UARTs P
B
U
Addr.
Area
f (MHz)
MHz
Area
1 1 1 1 1
2 2 3 3 3
1 1 0 0 1
T L T L T
1611 1613 1334 1337 1787
119 102 117 95 120
16.17% –14.07% 14.56% –18.57% 26.04%
39.7% 0.12% –17.29% 0.22% 33.65%
PROGRAMMABLE PLATFORM CHARACTERIZATION
23
Figure 5. PowerPC System Performance Analysis
(T vs. L) and indicates that data gathered in pre-characterization is easy to obtain, not intuitive, and more accurate than analytical cost models. Figure 5 illustrates Table 3 and shows area and separate performance traces for PPC systems in two addressing range styles. The graph demonstrates that whilst area is essentially equivalent, there are clear points in each performance trace with deviations greater than 10%.
5.
ORGANIZATION OF SYSTEM LEVEL CHARACTERIZATION DATA
The organization of the raw, extracted data is the process of categorizing the information in such a way that system simulation remains efficient, data remains portable, and flexible data interactions can be explored. This is a very important part of the characterization process and if a poor job is done in this stage, many of the benefits of the previous efforts will be lost. More concisely the goals are thus: • Maintain system efficiency – if the simulation performance of the system using estimated data (a naïve method) is PE and the performance of the system using characterized data (our presented method) is PC , the following relation must hold, PC ≥ PE . Performance in this case is a measure of simulation effort or cycles consumed which directly affect the execution time of the simulation or run-time memory requirement (higher performance results in lower execution time or run-time memory requirement). • Portable Data – in order to reuse data, it must be stored in such a way that it is maximally portable amongst various models. This requires three things: 1) A standard interface for accessing the stored data 2) A common data format for the stored data and 3) The ability for the data set to grow over time.
24
CHAPTER 2
• Flexible Data Interaction – data interaction refers to allowing many ways in which data can interact in order to give information regarding the performance of the simulation. For example if data regarding transactions/instruction can be combined with information regarding cycles/transaction one can determine the cycles/instruction. Another example is that if Transaction1 can use signals S1 or S2 and it is known that S1 resides along a longer path than S2 , Transaction1 can utilize S2 for greater performance. It is best to place no restriction on data interaction in so much as it does not conflict with any of the other characterization goals. 5.1
Data Categorization
With the goals defined for characterization data organization the second aspect that must be determined is how data is categorized. Data can be categorized in many ways depending on what is being modeled. For the sake of this discussion, it will be in the context of what is required typically for programmable architecture models of embedded systems. To this end there are three categories: • Physical Timing – this information details the physical time for signals to propagate through a system. Typically this information is gathered via techniques such as static timing analysis or other combinational or sequential logic design techniques to determine clock cycle or other signal delays. • Transaction Timing – this information is a unit of measure which details the latency of a transaction. A transaction is an interaction between computational units in point to point manner or through other communication mechanisms (buses, switches, etc). This could be a cycle count in reference to a particular global clock which determines the overall speed of the system. Or it could alternatively be an asynchronous measure. • Computation Timing – this information is regarding the computation time taken by a specific computation unit. This could be both HW and/or SW based routines. For example it could be a cycle count given by the time a HW unit (Adder, Shifter, etc) takes to complete its operation. Alternately it could be the cycle time taken by a particular software routine (Discreet Cosine Transform perhaps) running on a particular embedded processor. These three areas interact to give estimated performance for a model under simulation. The following example (Table 4) will show how all three areas can be used along with their ability to flexibility interact to provide performance analysis: 5.2
Data Storage Structure
Finally, it must be decided what actual structure will hold the characterized data. The primary concerns are related to the goals initially mentioned regarding portability and efficiency. This should be a structure that can grow to accommodate more
25
PROGRAMMABLE PLATFORM CHARACTERIZATION Table 4. Sample Simulation Instruction
Timing Categorization
Performance Implication
Read(0x64, 10B) execute(FFT) Write(0x78, 20B) Total Cycles
Transaction – 1 cycle/Byte Computation – FFT 5 Cycles Transaction – 2 cycles/Byte Physical – 1cycle/10ns
10 cycles 5 cycles 40 cycles 550ns
entries. Ultimately what structure will be used is determined by which system level design environments are intended to be used. However the follow issues should be considered: • What is the overhead associated with accessing (reading) the data? • What is the overhead associated with storing (writing) the data? • Can data be reorganized incrementally? • Can data be quickly sorted? Searched? More specifics on data structures for characterization data will be touched on in the next section when specific example executions are discussed. For now we leave the reader with an illustration of an abstract structure in Figure 6. The left hand side of the illustration shows the data categorization and where that data is generated. The right hand side shows a sample entry in the data storage structure where each system categorized has its own index and may have independent or shared entries in the storage structure.
Figure 6. Characterization Data Organization Illustration
26 6.
CHAPTER 2
INTEGRATION OF SYSTEM LEVEL DESIGN ENVIRONMENTS AND CHARACTERIZATION
Once the data has been extracted and organized it now must be integrated into a system level design environment for simulation. The following discussion will highlight the key issues associated with this integration and provide an example of each in the Metropolis environment. • Separation of architecture models and their scheduling – this requirement allows for the data structure containing the extracted data to be independently added and modified apart from the actual system. – In Metropolis, architecture models are a combination of two netlists. The first netlist is called the scheduled netlist and contains the topology and components that make the architecture instance (CPUs, BUS, etc). The other netlist is the scheduling netlist and contains schedulers for each of the components in the scheduled netlist. When multiple requests for a resource are made in the scheduled netlist it is the other netlist which resolves the conflict (according to any number of algorithms). The schedulers themselves are called quantity managers since the effect of scheduling is access to update a quantity (time, power, etc) of the simulation. See [8] for more information on Metropolis architecture modeling. • Ability to differentiate concurrent and sequential requests for resources – the simulation must be able to determine if request for architecture resources occur simultaneously and are allowed to be concurrent or if they should be sequential and if so what is the ordering. This is important since each request will be accessing characterization data and accumulating simulation performance information which may be order dependent. – In Metropolis there is a resolve() phase during simulation. This is the portion of simulation where scheduling occurs which selects from multiple requests for shared resources. This is done by quantity managers in Metropolis. • Simulation step to annotate data – during simulation there should be a distinct and discernable time where data is annotated with characterized data. – Metropolis is an event based framework in which generates events in response to functional stimulus. These events are easily isolated and augmented with information from characterization during scheduling (with the request() interface). Events are set up like the tagged-signal model of [3]. The overall message is that once the data is ready to be integrated into the design environment 1) it must be able to be added non-destructively 2) it must be able to augment existing simulation information detailing performance 3) the simulation must be able to correctly recognize concurrent and sequential interactions/requests for the characterized data.
PROGRAMMABLE PLATFORM CHARACTERIZATION
6.1
27
Sample Annotation Semantics
This section will demonstrate an example execution of a system integrated with the characterized data. This will be based on the Metropolis design environment. In this case, the structure holding the data is a hash table like structure indexed by information regarding the topology of the system. 1. An active Metropolis architecture thread generates an event, e. This event represents are request for a service. This will have been caused by a functional model mapped to this architecture needing a service (CPU, BUS, etc). This event can represent a transaction or computation request. 2. The service will make a request to its scheduler, with the request(e) method. This passes the request from the scheduled netlist to the scheduling netlist where e joins a list of pending events. While this event is waiting scheduling, the task that generated it remains blocked. 3. Once all events that can be generated for this simulation step have been created, the simulation proceeds to a resolve() phase where scheduling decisions (algorithms vary depending on the service they schedule) are made which remove select events from the pending lists. 4. annotate(e) select events by indexing the characterized database according to event information. This allows access to simulation quantities (like simulation global time) which can now be influenced by annotated events. Note that this requires no more impact on simulation performance as compared to estimated data (a requirement of our methodology). 5. Report back to the task that it can now continue (unblock the thread).
Figure 7. Sample Execution Semantics
28
CHAPTER 2
6. The process can occur recursively when transactions like read() use CPU, BUS, and MEM services. Figure 7 illustrates these steps in Metropolis. 7.
CHARACTERIZATION PERFORMANCE ANALYSIS EXAMPLE: MJPEG
To demonstrate how a programmable platform performance characterization method can be used to make correct decisions during design space exploration, the following multi-media example is provided. This example deals with evaluating various architecture topologies and illustrates the importance of accuracy in characterization and exemplifies the fidelity achieved with our method. In an exploration like this one, the designer is interested in choosing the design with the best performance. Therefore it is not as important that the exact performance be known, but rather that the ordering of the performances amongst the candidates is correct (hence the emphasis on fidelity). Without the method covered in this chapter, estimated values would be used to inform the designer of the predicted performance. These values may come from datasheets, previous simulations, or even best guesses. None of these are preferable to actual characterization as will be shown. The application chosen was Motion JPEG (MJPEG) encoding and both the functional and architectural models were created in the Metropolis design environment. Investigated are four Motion-JPEG architectural models. The topologies are shown in Figure 8. Each of them represents a different level of concurrency and task grouping. A single functional model was created in Metropolis which isolated various levels of task concurrency between the DCT, Quantization, and Huffman processes present in the application. These aspects of the functional model were then mapped to the architectural model. The diagrams show the architecture topologies after the mapping process. This was a one-to-one mapping where each computational unit was assigned a particular aspect of MJPEG functionality. The computation elements were MicroBlaze soft-processor models and the communication links were Fast Simplex Link (FSL) queues. In addition to the Metropolis simulation, actual Xilinx Virtex II Pro systems running on the Xilinx ML310 development platforms were developed. The goal was to compare how closely the simulations reflected the actual implementations and to demonstrate that the simulations were only truly useful when using our characterization approach. The results of a 32x32 image MJPEG encoding simulation are shown in Figure 8 as well. The table contains the results of Metropolis simulation and the results of the actual implementation. The first column denotes which architectural model was examined. The second column shows the results of simulation in which estimations based on area and assembly code execution were used. The third column shows the simulation results using the characterization method described in this chapter. Notice that the estimated results have an average difference of 35.5% with a max of 52% while the characterized results have an average difference of 8.3%. This
PROGRAMMABLE PLATFORM CHARACTERIZATION
29
is a significant indication of the importance of our method. In addition, the fifth column shows the rank ordering for the real, characterized, and estimated cycle results respectively. Notice that the estimated ranking does not match that of the real ordering! Even though the accuracy discrepancy is significant, it is equally (if not more) significant that the overall fidelity of the estimated systems is different. Finally the maximum frequency according to the synthesis reports, the execution time (cycles * period), and area values of the implementation are shown. This confirms that while one might be tempted to evaluate only the cycle counts, it is important to understand the physical constraints of the system only available with characterized information.
Figure 8. MJPEG Performance Analysis Example
30 8.
CHAPTER 2
CONCLUSIONS
Before the gap between designer productivity and design complexity becomes an impassible chasm, architects must complete a transition from RTL to ESL design methods. However, a complete path from RTL to ESL has not yet been established. The reasons for the ESL methodology gap include the difficulty of isolating a set of design technologies that solve ESL design problems for the diverse range of system types. Designers desirous of ESL performance analysis tools are also wary of the accuracy of the data they can recover from existing tools and models. In this chapter, we presented an ESL performance analysis technology for programmable platforms. Our approach united characterizations of actual platforms with abstract designer model simulations. The result is an integrated approach to ESL performance modeling that increases the accuracy of performance estimates. Our use of Metropolis quantity managers also eases design space exploration by separating the architectural models of a system from the specific timing model used during system simulation. Our future efforts with system level pre-characterization will begin with a deeper exploration of the tradeoff between accuracy and a given system model’s level of abstraction. Additionally, we will apply formal techniques to analyze the bounds of our approach which is currently simulation based. REFERENCES [1] Adam Donlin, Transaction Level Modeling: Flows and Use Models, In International Conferences on Hardware/Software Codesign and System Synthesis (CODES+ISSS ’04), 2004. [2] Douglas Densmore, Adam Donlin, Alberto Sangiovanni-Vincentelli, FPGA Architecture Characterization for System Level Performance Analysis, Design Automation and Test Europe (DATE), March 2006. [3] Edward A. Lee and Alberto Sangiovanni-Vincentelli, A Framework for Comparing Models of Computation, IEEE Transactions on CAD, Vol. 17, No. 12, June 1998. [4] Felice Balarin, Harry Hsieh, Luciano Lavagno, Claudio Passerone, Alberto Sangiovanni-Vincentelli, and Yoshi Watanabe, Metropolis: An Integrated Environment for Electronic System Design, IEEE Computer, April 2003. [5] G. Smith, D. Nadamuni, L. Balch, and N. Wu, Report on Worldwide EDA Market Trends, Gartner Dataquest, December 2005. [6] J. Vleeschhouwer and W. Ho, The State of EDA: Just Slightly Up for the Year to Date, Technical and Design Software, The State of the Industry, December 2005. [7] Kurt Keuzter, Sharad Malik, A.R. Newton, Jan Rabaey, and Alberto Sangiovanni-Vincentelli, System Level Design: Orthogonalization of Concerns and Platform Based Design, IEEE Transactions on Computer Aided Design, December 2000. [8] Abhijit Davare, Douglas Densmore, Vishal Shah, Haibo Zeng, A Simple Case Study in Metropolis, Technical Memorandum UCB/ERL M04/37, University of California, Berkeley, 94720, September 2004. [9] Gartner, “Market Trends: ASIC and FPGA, Worldwide, 2002-2008, 1Q05 Update”.
CHAPTER 3 USE OF SYSTEMC MODELLING IN CREATION AND USE OF AN SOC PLATFORM: EXPERIENCES AND LESSONS LEARNT FROM OMAP-2
JAMES ALDIS Texas Instruments France Abstract:
OMAP-2 is a platform for creation of systems-on-chip (SOCs). It is underpinned by a basic set of rules and guidelines covering programming models, bus interfaces and RTL design. It comprises the basic building blocks of an SOC: memory controllers, networkson-chip (NOCs), DMA controllers, interrupt controllers, processor subsystems, UARTs, timers and so on. SOCs can be constructed from the OMAP-2 platform by selecting the basic building blocks and adding device-specific IPs such as multimedia or modem coprocessors. The platform is pre-validated and timing-closed for a range of process nodes, so top-level SOC integration is straightforward
This paper discusses the role of performance modelling in the creation of both the OMAP-2 platform and the devices based on it. The platform itself is required to be highly generic: capable of supporting a wide range of functional and performance requirements, some of which may be unknown when the platform is created. Surprisingly, the same can be true for the specific devices, which are frequently openly programmable and expected to have a life extending beyond that of the products that drive their development. It is, however, generally true that device requirements are more precise than platform requirements. Because the SOCs built from OMAP-2 are highly complex, it is not possible to analyse performance satisfactorily using static calculations such as in spreadsheets. Therefore simulation is used. The requirements on the simulation technology are first and foremost ease of test case creation, ease of model creation, and credibility of results. The emphasis on test case creation is a consequence of the complexity of the devices and of the way in which an SOC platform such as OMAP-2 is used: because the whole motivation is to be able to move from marketing requirements to RTL freeze and tape-out in a very short time; and because in many cases large parts 31 M. Burton and A. Morawiec (eds.), Platform Based Design at the Electronic System Level, 31–47. © 2006 Springer.
32
CHAPTER 3
of the software will be written by the end customer and not by TI, the performancearea-power tradeoff of a proposed new SOC must be achieved without the aid of “the software”. Secondary requirements are: simulation speed, visibility of results and behaviour, modularity and re-usability, ability to integrate legacy and 3rd-party models. A modelling technology has been created based on: • SystemC. • Standard cycle-based modelling technology for bus interfaces taken from OCP-IP. • Privately-developed technology for test-case specification, module configuration, run-time control and results extraction. Cycle-based interfaces are used throughout, because cycle-accuracy is required in some areas and use of a single interface technology throughout the platform is essential. Cycle-accurate interfaces do not necessarily imply cycle-accurate functionality and in general the OMAP-2 simulations can be described as timingapproximate. The aim is to move to public domain technology in all areas as soon as appropriate solutions become available. This modelling technology is never used for software development. Virtual SOC platforms for software development are created independently. The OMAP-2 platform is hugely successful and the modelling activity is an established part of it, playing a central role in all new SOC development, and in maintenance and enhancement of the platform itself. Furthermore the modelling activity is increasingly being used in the support of customer product development based on OMAP-2 SOCs. The challenges for the future lie in making this technology usable outside the core OMAP-2 architecture team and in being able to import models from 3rd-party suppliers. Achievement of these goals is currently hampered by the lack of public standards for test case specification and for module configuration and control.
1.
OMAP-2 OVERVIEW
The OMAP-2 platform provides the basic building blocks to create a generalpurpose computer system on-a-chip [4]. It is designed for application and modem processors for mobile telephones. Typically multiple programmable processors of different types will coexist in an OMAP-2-based SOC, such as the OMAP2430 depicted below. In this case the main processor is a low-power RISC CPU, which is supported by other processors for video, audio and graphics, and by a large collection of memory controllers and peripheral interfaces. The principal shared characteristics of the modules to be found in the OMAP-2 platform are: • Bus interfaces. All OMAP-2 modules use the same protocol, namely OCP [1]. • Power management functionality and interfaces. • Interrupt/DMA-request interfaces.
USE OF SYSTEMC MODELLING IN CREATION AND USE OF AN SOC PLATFORM
33
Figure 1. OMAP2430 functional view
• Synthesis scripts assuring timing closure at common frequencies at common process nodes. • Programming models derived from a common base and common principles. • Security- and debug-related functionality. One vital element of the OMAP-2 platform which is not shown in the functional view of OMAP2430 is the interconnect or network-on-chip technology. It is absent because it is largely transparent to the user, whether software developer or hardware integrator. The network-on-chip (NOC) enables the processors and DMAs to access the memories and peripherals, using a common SOC-wide memory map. The NOC provides: • Address-based routing of bus requests. • Arbitration for concurrent access to shared memories. • Adaptation of OCP protocol features between incompatible initiators and targets, for example bus width conversion, conversion of burst types to those supported at the target, or conversion from single-request-multiple-data to multi-requestmultiple-data bursts. • Clock frequency conversion between modules running at different rates. • Programmable address-based connectivity control for enhanced security. • Detection, routing and logging of error events. There are typically multiple levels of NOC in an OMAP device and there are different NOC technologies available within OMAP-2, optimised for the different
34
CHAPTER 3
levels. Superficially these provide the same functionality, but are very different in terms of performance and area. The connectivity between the principal processors and the principal memories is critical to system performance and is allowed to consume more silicon area and power than the paths to rarely-used peripherals. The OCP protocol has been alluded to above. For full details see [1], but a basic overview of the important features of OCP is given here: • OCP is a point-to-point protocol and not a shared-bus definition. • OCP is core-orientated. It allows the creator of an IP core to specify the bus interface of the core independently of any NOC or bus technology it may be attached to later. • OCP is core-orientated and highly configurable. The creator of an IP core may specify which features of the OCP protocol are supported by the core and which are not. It is not necessary to add functionality to the core to support features of the protocol which are irrelevant to the core. • OCP may be an extreme-high-performance protocol, using features such as single-request-multiple-data, manifold burst types, pipelined transactions, out-oforder-responses and non-blocking multi-threaded activity. • OCP may be a very lightweight and low-cost protocol, if limited to features such as basic non-burst reads and writes. The OCP protocol and the modular multi-level NOC are conerstones of the OMAP-2 platform. They permit the rapid creation of SOC products. The architects of the new product are able to select the processors and peripherals they desire and be confident that these are compatible with each other and that they can be connected as required. This is in some ways an intrinsically bottom-up process, with apparently little scope for optimisation except through selection of the modules from the library, if the development timescales are to be held. However there is one module in every SOC that is created specifically for that SOC: the NOC. By playing with the topology, the level of concurrency and the level of pipelining in the NOC, it is possible to create SOCs from the same basic modules with quite different capabilities. This approach to SOC creation puts product performance analysis in the critical path. The product architects are able to fashion an SOC rapidly from existing material and to know immediately how big it will be, how fast it will run and (to a first approximation) how much power it will consume; but they must also know whether it meets its performance requirements. For this, architecture-level simulation is used, based on TLM concepts and the SystemC language [3]. This simulation capability is also a part of the OMAP-2 platform. The basic requirement on it is to be able to provide feedback on questions of product performance in the timescales for definition of an OMAP-2-based SOC, in the project phase before development resources are allocated and development starts. During this definition phase the SOC architecture is not stable and the performance analysis technology must live with this fact.
USE OF SYSTEMC MODELLING IN CREATION AND USE OF AN SOC PLATFORM
2.
35
OMAP-2 PERFORMANCE MODELLING TECHNOLOGY
The OMAP-2 performance modelling technology that has been developed is used for the following purposes: • Support of OMAP-2 platform development and maintenance. • Support of SOC product definition, as described above: validation of the SOC’s performance before RTL development starts. • Validation of details of SOC implementation, in particular the NOC configurations, during development. • Provision of reference performance data to RTL and silicon validation teams. • Response to queries from marketing and customers when new applications of an exiting SOC design are proposed. • Support of customers wishing to optimise the implementation of their application on the SOC (which DMA is it better to use? What size of burst should be used? What are the best arbitration options, etc.). This section will describe in more detail the SystemC technologies used in OMAP-2 performance modeling. In particular the cycle-based TLM standard standardised by OCP-IP and the configuration and control interfaces developed internally in TI. 2.1
Modules
The Figure 2 shows a simplified representation of the top level of an OMAP-2 SOC performance model. All the boxes are SystemC modules connected by SystemC channels. The modules fall into a small number of different categories: • Subsystems, which are just hierarchical divisions and contain further modules of the same sorts, connected in the same way. The hierarchy in general matches the hardware hierarchy of the SOC. Typically a subsystem comprises one or more processors and one or more DMAs or traffic generators. • Processors. Three different styles of processor model are used: – Stochastic, in which the processor generates random instructions, pretends to fetch them, then executes them. External memory accesses for fetch, load and store are filtered by stochastic cache models, in which the decision whether an access hits or misses is made through comparison of a random number with a cache miss ratio parameter. The power of such models is that with a very small number of parameters a representative bus activity can be created, even of the most complex software. Cache miss ratio and code profiling statistics are available for many classes of software and so a very large range of tests (protocol stack, signal processing, high-level-OS with user applications, games, etc.) can be run without significant software development or porting effort. Furthermore because no actual software is run, the parameters can be slightly degraded to test the sensitivity of the SOC to potential software variations. The models provide estimates of processor MIPS and simulated NOC/memory traffic.
36
CHAPTER 3
Figure 2. Generic example of SOC architecture model top-level assembly
Such models have been developed for RISC and DSP processors, Harvard and unified-memory, with L1 and L2 caches. Although they do not implement the function of processors, they may be said to be cycle-accurate. CPU and cache pipelines are modelled correctly, write buffers are implemented, and so on. – Trace-driven. Where the performance of the processor for a specific software is the primary consideration, a more detailed model is required which takes into consideration not only the statistics of the software but also the order of instructions executed. To achieve this, cycle-accurate processor and cache models are available, which replay a trace of the software execution rather than executing it afresh. This advantage of this is that software from another platform, for example a previous generation OMAP, may be tested without first being ported. Software porting, especially where OSs are involved, is a major task and is not attempted during the product definition phase. Furthermore, the use of traces, which include the effects of user I/O, provide repeatability – hard to achieve if the software is actually executed, for example in a game environment. – Instruction-set-simulators. Although used less than the other processor types, it is possible to instantiate a cycle-accurate ISS in the OMAP-2 performance simulation. This is at the moment restricted to DSPs. Such processor models are heavily used in DSP software optimisation, and instantiation in the SOC
USE OF SYSTEMC MODELLING IN CREATION AND USE OF AN SOC PLATFORM
37
model allows the effects of the overall system (for example increased latency caused by congestion on external memory) to be taken into consideration. The SOC model is not in general safe to use for the processor – there is no guarantee that memory exists where it should or that memory will not be overwritten by some random interference traffic, so the processor I/O is usually taken from the host filesystem and the external memory is fully cached inside the processor model. There is no requirement for an ISS model of the main RISC processor of the OMAP SOC. The cost of implementing a SystemC SOC model capable of supporting the interesting applications is too high, even before the cost of software porting and maintenance is considered. Configuration of the SOC, a task done by the RISC CPU in reality, is more easily accomplished in performance simulation by direct configuration of the modules (see below). • Memory controllers and memoriesThe memory controllers are modelled in a fully-cycle-accurate way. However they are not normally connected to memories. So a read operation will return data at the correct time, but it will not be the correct data. In general the whole OMAP-2 performance simulation platform can be described as dataless. • DMAs and peripherals. Similarly to the memory controllers, DMAs and peripherals are modelled cycleaccurately, but certain aspects of their functionality are not implemented. In the case of a DMA, it is the run-time programmability that is not present. Whereas in reality a processor writes to DMA registers in order to provoke a transfer, in the simulation the transfer is simply requested from the DMA model through a SystemC interface. This may be done at elaboration time, optionally with a delayed start, or at any time during the simulation by some other process (see below). Peripherals of interest to the performance simulation include serial port controllers, cryptographic accelerators and so on. A serial port controller model would be cycle-accurate on its bus and DMA/CPU interfaces but the serial data would not exist. Likewise a cryptographic accelerator would not encrypt the data given it, but it would act as though it had, making dummy data available at the correct time. In both cases any configuration (such as baud rate) would be done using a high-level SystemC interface and not by writing to simulated registers. • Generic traffic generators. Many of the main bandwidth consumers in an OMAP SOC have relatively simple and repetitive traffic patterns. The best examples of this are the display and camera controllers. In the OMAP-2 performance model such things are represented by simple traffic generators. These generators have a range of addressing modes and traffic types (burst/non-burst, SRMD, etc.). They generate traffic at a constant rate (with optional jitter) and may have real-time requirements and internal pipelining limitations. By combining several such generators, relatively complex
38
CHAPTER 3
traffic flows may be created. They may also be configured to behave in a highly randomised way, to create a sort of background load in the system. • NOCs. The networks-on-chip are the only fully cycle- and functional-accurate parts of the SOC model. The NOC technology used in OMAP-2 is based on generation of an NOC from a configuration file, which contains details of all the initiators and targets and the desired topology of the NOC. It is possible to generate both RTL code and SystemC code from the same input.
2.2
Interfaces and SystemC Channels
The modules described above all support the same basic set of SystemC interfaces. A small set of SystemC channels is used to connect them together. • OCP TL1 The OCP-IP has proposed a methodology for SystemC modelling of OCP interfaces [2]. Documentation and SystemC code (interfaces, channels and data types) are available. The proposals cover a wide range of abstraction levels: TL1 being cycle-accurate; TL2 being protocol-specific with approximate timing, and so on. In the OMAP performance model the OCP TL1 technology is used exclusively. It can be argued that many of the simulations do not require cycle-accuracy and certainly many of the traffic generators or peripheral models are not in any way cycle-accurate in their functionality. However the advantages of having a single interface and a single channel to deal with outweigh the simulation speed gains that might be available in a mixed TL1/TL2 simulation platform. The OCP-TL1 channel includes a monitor interface, and a simple monitor which dumps a trace to a text file is available. A TI-developed statistics-gathering monitor is also used in the OMAP simulations. This allows bandwidths and latencies to be extracted as simulation outputs. Any OCP interface in the SOC may be monitored in this way. OCP is a synchronous protocol. There is a clock associated with every point-to-point OCP connection. In the SystemC model, the synchronisation is accomplished using sc_clock() objects. All modules with OCP ports also have sc_clock input ports. • Interrupts and DMA. Requests TI has developed a simple TL1 interface for DMA requests and interrupts, and a SystemC channel for connecting interrupt generators to interrupt consumers. The main point of interest in this technology is that a single channel is instantiated in the whole SOC simulation. This allows the routing of interrupts and DMA requests to be done at run time, based on a configuration file, rather than being hard-wired as in reality. • Static Configuration. All the modules and channels in an OMAP SOC performance model support an elaboration-time/end-of-elaboration configuration procedure. This is used for:
USE OF SYSTEMC MODELLING IN CREATION AND USE OF AN SOC PLATFORM
39
– Providing hardware parameters to generic modules, for example cache sizes, filenames for trace or executable binaries, FIFO depths, bus widths, clock frequencies. – Providing modules with configuration information that would in the hardware be provided through register writes, for example baud rates, arbitration parameters, FIFO trigger thresholds. – Providing behavioural parameters to autonomous initiators, for example cache miss ratios, display refresh rates and screen size, DMA transfer parameters. • Each module or channel has a set of parameters which may be set, and the parameter values are passed to it in the form of an stl map, templated with a pair of stl strings. The first string is the parameter name and the second includes a letter to indicate the type, and then the parameter value. Such maps may be read from text files created by a user in a text editor. For example: # part of a map file for a stochastic CPU model mpu.cpu_type s:arm1136 mpu.cpu_ocp_clock_ratio i:3 mpu.data_addr_range x:00300000 mpu.data_base_addr x:05000000 mpu.inst_base_addr x:07000000 mpu.inst_miss_ratio f:0.02 • Run-time Control. Most of the simulations run on the OMAP performance model need only static configuration of the initiators in order to produce the desired behaviour. However in such simulations there is no interaction between the initiators. It is not possible for a process completing on the DSP to trigger the start of a DMA transfer, for example. In order to address this limitation, the modules (mainly the initiators) also support a second interface, which allows dynamic control of their behaviour during the simulation. Basically a secondary pure-functional simulation runs: it may start tasks on the OMAP initiators and is informed when these tasks complete. Preemption of tasks is possible, so the secondary simulation can model multiple tasks using the same CPU under control of an RTOS. This is illustrated in the Figure 3. In the figure the top half shows a representation (rather simplified) of a video conference application. It is purely functional, simply a chain of functions that have to be completed, each one triggering others. A complex application like this is hierarchical. In implementation, each function is mapped to some hardware, for example a DMA or a CPU. The curved black arrows show such a mapping. In our simulation technology there are two separate simulations within the same SystemC sc_main(). The pure-functional simulation provides the interactions between functions. For example it waits until the MPEG compression is complete before starting a DMA transfer of the compressed data to mass storage. The other simulation is the OMAP performance model as described above. Each function in the pure-functional simulation includes a set of parameters for one of the OMAP modules, on which it
40
CHAPTER 3
Figure 3. Pure-functional simulation driving OMAP performance simulation
will be executed. The two simulations are linked by a set of schedulers, which allow multiple functions to be active at the same time even if they share the same CPU. For example several functions can run in turn on a stochastic CPU, with its own cache miss ratios and a number of instructions to execute before it is done. The interface implemented by the OMAP-2 initiator models in order to allow this dynamic control is:
USE OF SYSTEMC MODELLING IN CREATION AND USE OF AN SOC PLATFORM
41
class dynamic_configuration_if { // start a task on a processor // virtual bool start_task(sc_event &done, int task_id, const config_map &cm) { return(false); } // ask a processor to stop executing a task // // so that a new one can be started // virtual bool preempt_task(sc_event &ready, int task_id) { return(false); } // find out the state of a running task // virtual bool get_task_status(int task_id, config_map &cm) { return(false); } } Use cases (applications) defined in this way can also be executed standalone, and can very easily be re-used from a model of one SOC to a model of another, without requiring the same function-to-hardware mapping. 3.
INTERACTION BETWEEN OMAP-2 PERFORMANCE MODELLING AND OTHER OMAP-2 SIMULATION PLATFORMS
The architecture-level SOC performance model described here is not the only simulation model of an OMAP-2 SOC that is created. All of the following models are available: • SystemC architecture-level performance model. • Virtual platform model for software development. • RTL simulator, including the options to substitute fast ISSs or simple traffic generators for the processors. • FPGA model. The different models serve different purposes, require different levels of effort to use, and become available at different times during the project. The SystemC performance model is always available first and is always the simplest to create and use. The virtual platform is the next to become available. It is used for software development and has very little timing accuracy. TI uses Virtio technology to create this model rather than SystemC [5]. The lack of accurate timing in this model means that low-level software has to be validated on another platform and this is the motivation behind the FPGA model. The FPGA model can also be used for
42
CHAPTER 3
performance investigations. It complements the SystemC model, being much less flexible and requiring software, but having a degree of completeness and accuracy that is not attempted in SystemC. RTL simulations are in general too slow for either software development or performance investigations, but are the final reference in cases of doubt, and have the advantage of complete visiblity into the SOC behaviour. It would appear the choice of two different technologies for the virtual platform and the performance model is inefficient, wasting potential code re-use. However the two have completely different (almost fully orthogonal) requirements and at module level there is almost no code re-use possible. This is illustrated in the following Figures 4 and 5. Figure 4 shows a breakdown of a module into different aspects, whose importances vary depending on the level of abstraction. This example is an OCP slave,
Figure 4. Aspects of a typical module to be modeled
Figure 5. Architect’s view and programmer’s view of the module
USE OF SYSTEMC MODELLING IN CREATION AND USE OF AN SOC PLATFORM
43
a peripheral of some kind, with a register file, some functionality triggered by writes to the registers, and some timing associated with execution of the function. Furthermore the peripheral has a bus interface which is compliant with some protocol, in this case OCP. A complete model of the peripheral would implement all this. And a generic model architecture following this breakdown has often been proposed in the ESL industry. Figure 5 shows that the model needed for the SOC architect’s performance analysis is completely orthogonal to that needed by the software developer’s virtual platform. On the left we see that the architect needs the timing and the bus interface. The architect is concerned that the parameters of the bus interface are correctly chosen such that the function can be implemented, and needs to be able to look at the cycle-by-cycle behaviour on that interface. But the functionality is of no interest to the architect. Suppose this is a cryptographic accelerator: it doesn’t matter to the architect whether the encryption is done properly or not. In fact it may be a hindrance if apparently random data is visible on the bus, making it much more difficult to correlate input and output. Furthermore, as discussed above, the architect will want to be able to trigger the function without writing into the registers via the bus, because that would require writing or modifying software, a time-consuming activity which may not even be possible. On the right, by contrast, we see that in the virtual platform only the encryption function and the registers are important. The software engineer does not care at all about the bus interface and the virtual platform generally discards timing to improve simulation speed. For NOCs and memory controllers, the difference between the architect’s view and the programmer’s view is yet starker. These modules are the central core of the architect’s model, but do not even exist in the programmer’s view, except maybe as a few register stubs, their main functionality being fully software-transparent.
3.1
Usage Example of Performance Model
The performance model exists because a generic and rapid-deployment performance model is essential for a platform-based SOC factory, as discussed above. But more than this, there are certain things that can not be achieved without this kind of technology. Here an example is presented, in which the performance limits of an OMAP-2 SOC have been probed using the model. The use case is a videoconference. This is easy to say, but when it comes to the details many many choices need to be made. Development of software able to implement all these choices as run-time or even compile-time options is practically impossible. On FPGA or silicon each videoconference to be analysed requires development resources. On the SystemC performance model, on the other hand, a regression of 144 videoconferences has been created, that can be easily applied to any OMAP application processor. The results of this are available to OMAP marketing for understanding the limits of each platform in advance of specific customer queries.
44
CHAPTER 3
Figure 6. Example results, from simulations of an ensemble or 144 videoconference scenarios
Some of the parameters that are varied in order to create the 144 scenarios include: • Display size, refresh rate and orientation. • Configuration of windows on the display, including re-scaling and rotation requirements for the video. • Size of the base image used in the videoconference. • Compression algorithm used (MPEG or other, with stabilisation or without, and so on). • Mapping of videoconference functional elements to OMAP hardware. • Configuration of SOC, including external memory size and performance, arbitration options, burst usage, clock frequencies. Figure 6 shows some of the results generated. These are bandwidths measured on the external memories. Similar bar charts exist for latencies, CPU occupancies, FIFO occupancies for hard-real-time functions, and so on.
4.
FUTURE DIRECTIONS AND THE NEED FOR STANDARDS
The above sections have described the OMAP-2 platform and its SystemC-based performance modelling infrastructure. This infrastructure is one of the technologies essential if real benefits are to be drawn from a platform-based SOC methodology. The emphasis in this paper has been more on the platform-user’s requirements and workflow than on those of the platform-supplier. Within TI’s OMAP organisation the distinction between platform-user and platform-supplier is relatively small and most of the issues raised apply to both.
USE OF SYSTEMC MODELLING IN CREATION AND USE OF AN SOC PLATFORM
45
The use of SystemC for performance modelling must fit into a broader methodology for SOC definition and development. Electronic System Level [Design], or ESL, is generally used as an umbrella term for this kind of thing. ESL encompasses themes as diverse as synthesis from sequential C code to RTL and virtual platform use for early software development. Within TI a number of tools and technologies have been or are being adopted and SystemC is seen as a part of the overall ESL puzzle rather than as a central uniting theme. Non-SystemC ESL activity includes use of executable specifications, requirements and use case capture, top-level SOC integration automisation, memory-map and register map capture. The performance analysis modelling environment must interwork with these tools and therefore it is important that the SystemC TLM technology not restrict itself to a SystemC-foreverything worldview. It is not expected that the OMAP performance model will be used to generate RTL. Rather it is expected that the same tool will generate the top-level RTL and the performance model. Work on the modelling platform is continuing but in some areas there is a strong desire for public standards, to replace the ad-hoc technology developed within TI, making it cleaner, available to 3rd-party suppliers, and supportable by EDA vendors. Also there needs to be widespread agreement on the types of model that are required. So far it seems that the types of model used in OMAP-2 performance modelling have not been widely proposed.
4.1
Author’s Opinion on Desired New Standards
• SystemC bus interfaces (TLM) The existing TLM (v1) standard from OSCI is attractive as technology but wholly inadequate as a way of ensuring compatibility between models. On the other hand the TLM technology developed by OCP-IP meets this requirement (nearly) but is limited in applicability, providing only the OCP protocol and by being somewhat different in concept from OSCI TLM. As discussed above, the abstraction level of an interface does not necessarily match the abstraction level of the simulation or module. Therefore it is inappropriate to name the interface levels programmer’s view, architect’s view, and so on. Names should rather be chosen reflecting their potential accuracy. The author would like to see OSCI provide the following: – Base technology for implementing cycle-based bus interfaces at an accuracy equivalent to RTL. The technology should include features such as: ◦ elaboration-time configuration. ◦ distribution of timing information to permit combinatorial paths within TLM. ◦ ability to specify preemptive-accept without additional processes. ◦ generic monitoring interfaces. – Base technology for implementing event-driven bus interfaces. This would be for asynchronous protocols at RTL-accuracy or cycle-based protocols with approximate timing. The same features are required as for a cycle-based interface except that distribution of timing information and preemptive-accept
46
CHAPTER 3
are not required, but definition of how to timing-annotate the interface or the data types is. – Guidelines for making protocol-specific versions of the base technologies, which requires: ◦ how to divide a bus protocol into phases. ◦ how to define data types for those phases (note: use of C++ templates is a failure to address this issue, passing it on to the end user). ◦ how to integrate multiple phases into a single interface and channel. – Technology for generic (non-protocol-specific) timed bus interfaces. Because this would not be protocol-specific it would be completely defined by OSCI, including all required data types, and would not need any guidelines for extensions. – Technology for generic untimed bus interfaces, for maximum speed purefunctional simulations. Because this would not be protocol-specific it would be completely defined by OSCI, including all required data types, and would not need any guidelines for extensions. • Parameter interfaces for modules Standard interfaces are required for passing configuration information to TLM modules, and recovering information from them for debug or performance information for simulation results. As described above, there are different types of information which may need different interfaces: – For generic modules, hardware-configuration parameters, such as cache sizes, filenames for memory pre-load, FIFO depths, bus widths, clock frequencies and so on. – Static parameters, including both register backdoors and activity initiation. – Dynamic behaviour control (task start, interrupt, status). – Debug, status and statistics information coming from the module. • Integration guidelines for SystemC models and other ESL and SOC development tools In the context of a platform-based SOC methodology, it is attractive to package the platform modules in a form that allows automatic extraction of different views. For example a top-level integration tool would extract bus interface parameters for modules in order to connect them to the NOC. A powerestimation tool would take the top-level description from the top-level tool, use-case descriptions from the use-case database and module power consumption information or models from the platform library. The different models (architecture performance estimation and virtual platform) would be constructed automatically from the top-level integration information and existing SystemC models within the library. Standards are required for packaging of modules which can include different models, which in turn requires that the different types of model be well-defined and use standard interfaces only. The most likely home for standards of this sort is within SPIRIT, but alignment with OCP-IP is essential. • A standard language for functional description of applications, for use in SOC use case capture and documentation. UML has been proposed as a basis for such a standard.
USE OF SYSTEMC MODELLING IN CREATION AND USE OF AN SOC PLATFORM
47
REFERENCES [1] OCP Protocol www.ocpip.org [2] OCP TLM Modelling www.ocpip.org/data/ocpip_wp_SystemC_Communication_Modeling_ 2002.pdf [3] SystemC www.systemc.org [4] OMAP Platform http://focus.ti.com/general/docs/wtbu/wtbugencontent.tsp?templateId=6123& navigationId=11988& path=templatedata/cm/general/data/wtbovrvw/omap [5] OMAP Code Development Tools http://focus.ti.com/general/docs/wtbu/wtbugencontent.tsp? templateId=623&navigtionId=12013&path=templatedata/cm/general/data/wtbmiddl/omap_ development
This page intentionally blank
CHAPTER 4 WHAT’S NEXT FOR TRANSACTION LEVEL MODELS: THE STANDARDIZATION AND DEPLOYMENT ERA
LAURENT MAILLET-CONTOZ STMicroelectronics
This article reports on the experience acquired in ST for the past five years in the Transaction Level Modeling area. It presents the TLM approach as the key for addressing new challenges for state of the art Systems on Chips. After introducing the requirements, use models and concepts, we detail the implementation of the modeling approach followed by ST for the four past years. Using this strong expertise, we describe the needs for extending the existing standards and tools to cover TLM needs, in term of modeling concepts, as well as in term of platform automation. 1.
INTRODUCTION
Systems on Chips become more and more complex. The current trend, on top of Moore’s law, is to integrate more and more functionalities. Not only more and more hardware implemented functionalities are available, but also more and more firmware and application software are integrated onto SoCs. Taking into account these features, expectations keep rising in term of time-to-market capabilities and reliability level. To us, the only way to address all these requirements is to raise the level of abstraction to cope with state-of-the art SoCs, and to provide new modeling methodology, design methods and tools. As such, the system level design area became of major importance for several years. Semiconductor companies have been developing for the last couple of years in-house methodologies to support advanced modeling flow at the Transaction Level Model (TLM). The offer of CAD tools supporting SystemC and TLM constructs is wider than ever, but is still inconsistent. However, this approach will not be widely deployed and adopted, unless models developed in different teams or companies can be assembled together to create system models. This calls, on one hand, for 49 M. Burton and A. Morawiec (eds.), Platform Based Design at the Electronic System Level, 49–57. © 2006 Springer.
50
CHAPTER 4
standard communication mechanisms to be identified and shared throughout the industry, and on the other hand, for tools that increase engineer productivity in model development and integration. This article is organized as follows. In the next section, we locate TLM as the new level of abstraction required to cope with System Level Design issues. Then, we present the modeling approach adopted within ST at the transactional level. From this experience, we identify in the second part the need for standard communication mechanisms and tools. 2. 2.1
TLM: A NEW LEVEL OF ABSTRACTION TO ADDRESS SoC COMPLEXITY User Requirements for High Complexity SoCs
To address the challenges listed above, new modeling techniques must be provided to system simulation users. One can list four compulsory arguments on top of a full bunch of expectations: 1. Simulation speed: high level models must provide very efficient simulation speed (100x to 1 000x improvement as a matter of thumb) to support interactive activities like embedded software debug. 2. Model accuracy: Models must be accurate enough to support high level activities. It is often said that models are supposed to be cycle accurate to support all activities. From our experience, this is not a strong requirement, as several activities may be operated without timing accuracy. For example, embedded software development can be pursued with functional models, even if timed models may be required to optimize the software implementation. 3. Light weight modeling: to adopt new techniques, it is required to have a light weight approach to reduce the effort to get new models up and running. 4. Early availability: to enable activities ahead of the availability of RTL models, high level models must be available early. 6 months beforehand is very convenient. 2.2
Use Models
When users get SoC simulation platforms in advance compared to RTL models, and far advanced compared to a prototype, they can go ahead with these three key activities: • Functional verification. Reliability of the SoCs is obviously a key factor in various application area, and is required to gain the time to volume challenge. Therefore, improving the functional verification process, either by reducing the verification time, or extending the verification coverage is a strategic axis. Moreover, getting a unified verification environment at different levels of abstraction (RTL, TLM), improves the process by reusing the same test bench either on the TLM and the RTL models.
THE STANDARDIZATION AND DEPLOYMENT ERA
51
• Embedded software development. To improve the overall design cycle, early embedded software development enables real hardware/software co-design. The gain is twofold: first, early software development helps to identify specification misunderstanding between software and hardware teams, by sharing the same executable specification. Second, software developers are able to develop and debug their embedded code on a simulation platform that is easier to instrument than the real hardware. This platform can be delivered to all and every developers, and can be used for debug purpose, even when the actual chip is available. • System analysis. In an early phase of the design process, it is also very useful to investigate different functional scenarios. 2.3
Initial Experiments Towards the Definition of TLM
From the year 2000, ST has been involved in various modeling techniques, to address the requirements above: • Cycle accurate models: With the emerging C-based dialects that support hardware concepts, it seems convincing that cycle-accurate models developed in a Cbased environment could meet Requirement 2 mentioned earlier for raising the abstraction level. However, this hypothesis has stumbled upon a few obstacles (Most of the information captured by cycle-accurate models is unavailable in IP documentation, Cycle-accurate models can simulate merely an order of magnitude faster than the equivalent RTL models). • Co verification platforms: Despite the numerous benefits yielded by the coverification, it is still too long to wait for the development of RTL hardware models before the co-verification can be conducted. Therefore, they do not address requirement 4 (and in a certain regard, neither requirement 1). • Performance models: This model is mainly opted for the performance analysis of a system. While timing analysis is the focus of temporal models, analytical accuracy is forgone. Some efforts were given in the development of the temporal model. The resulted model provided extremely high simulation speed but with little or virtually no functional accuracy guaranteed. these models are well suited for requirements 1, 3, 4, but they are not detailed enough to address requirement 2 (namely for software developers). As shown above, these various experiment do not address all the listed requirements. We have therefore developed a new modeling methodology to cope with them, the so called Transaction Level Modeling Methodology. It is described in the coming section. 3. 3.1
TRANSACTION LEVEL MODELING AT ST A Pragmatic Approach for TLM
ST has developed a full set of modeling classes and APIs, between 2000 and 2003, to address the requirements and user needs listed in the previous section. After getting experience and know-how on other techniques, we have investigated high
52
CHAPTER 4
level modeling approaches. More than only yet another programming language, we tried to identify the minimal set of modeling primitives required to support the three activities listed above. We came up with a working solution based on SystemC, which has the following advantages: • Open source modeling language and simulation kernel. • Easy sharing of models. • Rapid learning curve for new comers in model development. We detail in the coming sections the implementation choices of the approach. 3.2
A Single SoC/IP Reference Model
In our modeling approach, we have separated behavior and communication. Following the same principle, we also have kept separated the functionality and the timing. Untimed models are located at the so called Programmer’s View (PV) level. Models aligned with this concept mainly address: • Functional verification. • Development of functional embedded software, without timing optimizations. • Co verification with RTL models. Adding timing to the models provides Programmer’s View + Timing (PVT) models. They are well suited for: • System performance analysis. • Fine tuning of embedded software. Having a single reference model enables: • Rationalizing modeling efforts: the proliferation of different models to address complementary (and often overlapping) is a very common issue in the industry. It is indeed very difficult in this context to identify only one model that covers all the expectations. Therefore, having a single model built by a central organization that covers all the needs for the various activities saves a lot of efforts. • Consistency between developments: Project teams develop concurrently various tasks to build the overall functionality. It is obviously desirable to make sure that all the developments will integrate smoothly by the end of the project. • Communication between teams: Sharing a single model enables also a common executable specification to be shared by the developers. This avoids incompatible interpretations of the initial specification, and gives the opportunity to the contributors to the project of discussing together about requirements, respective assumptions and expectations. 3.3
In House TLM Protocol and IP Portfolio
In the very first implementation, we were relying on the SystemC 2.0.1 classes and developed our own modeling classes. Built as a layered approach, we implemented the very low-level mechanism to exchange transactions between two modules, and
THE STANDARDIZATION AND DEPLOYMENT ERA
53
then built on top of that a set of convenience functions to provide user-friendly API for an abstract bus model. After donating this first implementation to the OSCI consortium, a TLM standard was available. The current implementation is aligned to the OSCI TLM 1.0 standard and SystemC 2.1. Basically, we have restructured the code to define a protocol with appropriate communication data structures. The communication is based on the transport function as defined in the tlm_transport_if core TLM interface. Request and response data structures convey the actual transaction payload. On top of that we also have developed a convenience layer, to ease the use of the communication API. This provides the user with a set of functions. Taking benefit from these classes, we have developed a full set of IP models. Some of them are of general purpose, as bus or memory models. Some others are specific to an application area, like video codecs. We also have wrapped the main ST and external processor models to support software development. 4. 4.1
TOWARDS STANDARD MECHANISMS FOR TLM MODELS OSCI TLM Standard
SystemC has been thought as an open modeling language. Not really a language, as it consists in a collection of modeling classes on top of C++, it has already proven its interest as the reference language for system level models. As such, SystemC provides a simulation kernel and a set of basic classes, like signals and channels. In its 2.0.1 version, it permits advanced users to implement their own TLM mechanisms, but does not provide system level constructs that could be shared within the semiconductor industry. Therefore, the OSCI consortium has been defining a TLM standard in the last two years. This standard focuses on the low level mechanisms to make models communicate. Several interfaces have been defined: • Bidirectional blocking interface: transport function of the tlm_ transport_if interface. • Unidirectional interfaces: put and get functions of tlm_blocking_put_if and tlm_blocking_get_if, for blocking interfaces. The same is defined for nonblocking interfaces. However, the definition of a standard at this level is not enough to ensure models interoperability. 4.2
IP Interoperability
Interoperability, at the TLM level, covers a set of different expectations. Below is a set of requirements to be addressed by a TLM standard at the PV level: • Execute eSW functionally correctly up to application level (speed ). – Register/Bit accuracy. – Functional accuracy.
54
CHAPTER 4
• Enable system architecture analysis (dataflow, deadlocks, etc.). – Model communication (data, synchro) between IPs of the system. – No timing involved. – Modeling of system synchronizations. • Insulate architecture (and therefore eSW) from its implementation. • Detect micro-architecture assumptions of the system. As such, an extension of the OSCI TLM standard 1.0 is required to support all these features. It should basically address the minimal set of attributes to be shared as a generic PV interface, i.e. common to all busses or interconnects, and provide an extension mechanism to cover extra features, specific to a given context. The same should apply for PV+T level. At the user level, convenience functions should however not be part of the standard. One should leverage on the definition of the common data structures. Typical transaction payload should contain address, data , byte enable information. It is also desirable to include metadata and configuration mechanism, as well as debug accesses. The advantage of including these features in the protocol is to use the same protocol for all this, removing the need of dedicated protocols or interfaces for configuration and debug. It should therefore simplify the overall assembly of the platform, by reducing the number of ports and different protocols used in a platform. 4.3
Platform Assembly
Platform assembly is also the key for the deployment of the TLM approach. Indeed, platforms become more and more complex. They integrate an unceasing growing number of models. As a consequence, the models are provided by various sources, within the same company, and even more frequently by various providers. Obviously, all the models should rely in the same TLM interface to ensure seamless integration. This is the objective of the TLM standard. This is however not enough to support platform automation. Due to the complexity of the platform, it become crucial to support automated platform assembly, using graphical tools. The platform assembly tools should be able to instantiate and bind models coming from different sources. Therefore a standard to describe IP and subsystem interfaces, at various levels of abstractions, is compulsory. SPIRIT is today the best candidate to address this need. 5.
TOOLS FOR TLM DEPLOYMENT
Defining standards is obviously the first step to achieve towards a wide deployment. This is unfortunately not enough. Standards will define data structures, APIs, and semantics of the transaction payload. It will not directly increase user productivity, but will enable EDA companies to provide tools that support models developed on
THE STANDARDIZATION AND DEPLOYMENT ERA
55
top of the standards. These tools should provide support for model development and optimization, for platform integration and debugging, and for platform verification. Let us now detail these different categories. 5.1
Model Development and Optimization
Model developers currently feel alone, in face of their text editors. While advanced users may reuse some existing models as code templates, new comers should be assisted in the development of models. This can be supported either by providing model libraries for commonly used models, like memory models, or by providing tools that make developer life easier. It ranges from integrated development environments, model creation wizards, to model analyzers. In this category of tools, model developers also wish checkers for coding guidelines, SystemC/TLM debuggers and profilers. As the standards widely rely on the C++ template features, the developers will also require to get more user friendly interfaces for compilers or linkers (e.g. to have more readable error messages in case of erroneous usage of polymorphism). It is fairly obvious also that model developers will need kind of SystemC profilers to analyze the performance of their models. In one hand, they will validate that the code they have written is efficient, regardless of the simulation environment. On the other hand, they will also try to assess if the models are in line with specific model of computation, e.g. to reduce the number of wait statements executed during the simulation, and thus improving the overall simulation speed. 5.2
Platform Integration and Debug
Once standards are defined, models are available from various sources and are integrated in a single platform. On one hand, it will hopefully raise the number of available models. On the other hand, this might raise lots of issues related to model interoperability, like: • Incompatible data/address types for port binding. • Erroneous system synchronizations. • Incomplete functional models. Therefore, having an integrated environment that provides platform integration capabilities, graphical visualization of the hierarchical subsystems instantiated in the platform will be the key for wide adoption. In this cockpit, one should be able to plug model viewers, netlisters, compilation framework and platform debug tools. Among the large set of debuggers, one can also list new features like message sequence chart viewer to understand dependencies between processes, advanced simulation kernels that support randomization of process selection or integration of profiler information.
56 5.3
CHAPTER 4
Platform Verification
Platform verification is a very wide topic, in the sense that it includes: • Safe integration of the various components and the associated embedded software. • Functional verification. • Performance verification. Therefore, to enable the deployment, tools should support all the features above. Safe integration of the components should be supported by platform assembly tools, relying on the SPIRIT standard. The management of the associated embedded software remains today an open issue. Functional verification can be achieved with different means, like trace comparison, runtime score boarding, etc. A major topic to address with ESL tools is the validation of the TLM platforms, compared to the “equivalent” RTL platforms. Today, comparison is mainly achieved by comparing the memory content by the end of the simulation. This requires exercising both platforms with the same stimuli, and having good confidence in the coverage of the test vectors. Even at the TLM level, simulations of complex systems may take a lot of time. Therefore, exhaustive coverage can not be fully obtained. This calls for new techniques and tools for model and platform functional validation. Among these techniques, more formal methods could help in identifying the fundamental differences between TLM and RTL views of the same IP. Research work is done in this area, either through formal verification or runtime verification techniques; tools should be derived from these activities. Ensuring the “functional equivalence” between the TLM and RTL platforms is obviously required to ensure a valid verification flow, used in production. However, this is the first step only towards a more comprehensive verification environment, in which performance verification also takes place. Indeed, it is also desirable to get tools that help the verification engineer to understand if the platform he has modeled will meet the real time constraints of the application. On top of functional (PV) models, Ip providers should also provide performance (PV+T) models. EDA companies should provide tools to support system performance analysis, and investigate various timing scenarios. 6.
CONCLUSION
In this article, we have described the Transaction Level Model approach as the way to address the increasing complexity of Systems-On-Chips. We have reported on the experience acquired within the past years at ST, and detailed the rationale for standards at the TLM level. To us, the first item to be standardized is a PV protocol, as an extension of the TLM OSCI standard. It should support transaction level communication between initiators and targets, support address, data and byte enable information. In the second part of the paper, we focus on the tools required to increase the productivity of model developers, platform integration and verification engineers.
THE STANDARDIZATION AND DEPLOYMENT ERA
57
Among a large variety of tools required by these activities, one can list TLM debuggers, profilers, linters, pretty printers for compilers to improve model development. To support platform assembly, netlisters, graphical platform viewers and compilation framework are also required. Then for platform functional validation, engineers will require monitors and recorders, transaction viewers, and more generally trace analysis tools. For performance verification, tools should support various timing scenarios and provide reports on the simulation figures. Today, the TLM activity is still in his infancy and is ramping up significantly. The only way to consolidate the activity, spread it over the industry is to provide standards, that ensure models interoperability on one hand, and tools supporting all required standards to improve productivity of modeling engineers and platform integrators.
This page intentionally blank
CHAPTER 5 THE CONFIGURABLE PROCESSOR VIEW OF PLATFORM PROVISION
GRANT MARTIN Tensilica, Inc.
In this chapter we discuss the topic of platform-based design, performance analysis and modeling of platforms for end-users from the standpoint of a provider of intellectual property – particularly, configurable and extensible processors – and of more complex aggregates of processors and other devices, which can be said to form a “configurable platform of configurable processors”. In particular, we answer the questions of what platform providers must provide in the way of models, characterization parameters and other important design data; the sources of these models and data, and ancillary design tools; the use model for the end user of the platform, and the current state of the art, and likely evolution of these trends. Being in the position of a provider of configurable processors brings a unique perspective to this analysis. 1.
CONFIGURABLE, EXTENSIBLE PROCESSORS AS INTELLECTUAL PROPERTY
One of the key trends in recent years in intellectual-property based design is the increasing trend to using application-specific instruction set processors (ASIPs) for more and more of the basic design architecture, and to increasingly differentiate products with software [1]. ASIPs provide the programmability of general-purpose processors, while realising the performance, size, and power requirements of embedded system-on-chip designs. Configurable, extensible processors, automatically generated from a structural and instruction extension specification, along with all required software tools, have been available as embeddable intellectual property for a number of years. They have been offered by several providers, and have been 59 M. Burton and A. Morawiec (eds.), Platform Based Design at the Electronic System Level, 59–70. © 2006 Springer.
60
CHAPTER 5
successfully used in many embedded applications. Their use in designs, both as vehicles for general software applications, and as targeted replacements for HW blocks designed at the RTL level, challenges our design flows and methodologies. When designers wish to make use of multiple embedded processors, configured in a variety of ways to best meet specific processing requirements, the challenges grow quickly. Here we use the Tensilica processors as examples to illustrate the challenges of platform provisioning so that designers are able to optimise their use of these IP blocks either singly or in multi-processor configurations. The most recent generation of Tensilica configurable processor architectures is the Xtensa LX architecture introduced in the summer of 2004 [2, 3]. The Xtensa LX architecture has a number of configuration options. Instruction extensions can be described in an Instruction Set Architecture (ISA) description language, the Tensilica Instruction Extension (TIE) language. These configuration/extension options may be thought of as “coarse-grained” (structural) and “fine-grained” (instruction extensions via TIE). Configuration options include: • Single or multi-issue architecture where multiple execution units may be active simultaneously, configured by the user. • Flexible-length instruction-set extensions (FLIX) for efficient code size and an ability to intermix instruction widths and create multi-operation instructions for multi-issue. • An optional second load/store unit to increase the classical ISA bandwidth (for example, in DSP applications). • Either 5- or 7-stage pipeline, the latter to improve the match between processor performance and on-chip memory speed. • Size of the register file. • Optional inclusion of specialized functional units, including 16- and 32-bit multipliers, multiplier-accumulator (MAC), single-precision floating-point unit, and DSP unit. • Configurable on-chip debug, trace, and JTAG ports. • A variety of local memory interfaces including instruction and data cache size and associativity, local instruction and data RAM and ROM, general-purpose local memory interface (XLMI), and memory-management capabilities including protection and address translation. These interfaces may be configured for size and address-space mapping. • Timers, interrupts, and exception vectors. • Processor bus interface (PIF), including width, protocol, and address decoding, for linking to classical on-chip buses. • TIE instruction extensions include special-purpose functions, dedicated registers, and wire interfaces mapped into instructions. • Options for communications directly into and out of the processor’s execution units via TIE ports and queues. These allow FIFO communications channels (either unbuffered, or n-deep), memories, and peripheral devices, to be directly hooked into instruction extensions defined in the TIE language. As a result,
THE CONFIGURABLE PROCESSOR VIEW OF PLATFORM PROVISION
61
multi-processor systems using direct FIFO-based communications between execution units in the processor datapath are possible. Such a variety of configuration and extension opportunities for embedded processors more easily allow architects and designers to consider the design of complex multiprocessor platforms. These may include both symmetric and asymmetric multiprocessing approaches, although the chance to tightly optimise a processor for a specific dedicated task promotes thinking about asymmetric approaches in which heterogeneous processors are interconnected by a customised interconnect network and hierarchy of memories for a specific embedded application. Since the base processor is relatively small (under 20,000 gates), and since instruction extensions together with dedicated HW FIFO interconnect allows very high performance for a configured processor, it now becomes feasible to consider the replacement of many HW RTL blocks with dedicated processors, for many functions. This of course lowers risk of design, reduces verification requirements and allows in-field flexibility when compared to RTL based approaches. To support this design approach, a highly automated processor generation process is essential. This is provided in Tensilica’s case through a a web-based configuration service accessible to designers through client-based software. The automated process allows designers to enter configuration options via a series of entry panels, where the feasibility of the set of configuration parameters is cross-checked during entry and the designer given warnings or errors for erroneous combinations of parameters. Instruction extensions are created in the TIE language using an editor that is part of the Tensilica Integrated Development Environment (IDE), Xtensa Xplorer, which is based on Eclipse (www.eclipse.org). The TIE compiler checks the designer-defined instruction-extension code and generates a local copy of the relevant software tools (compiler, instruction-set simulator, etc.), which can be used for rapid software development and experimentation long before chip fabrication. At some point, the designer will want to generate the complete set of software tools and the HDL hardware description for the processor. The generation process allows the configuration options and the designer-specified TIE files to be uploaded to Tensilica’s secure server environment. Within one to two hours, the complete ASIP will be generated and verified. The software tools and RTL for the ASIP can then be downloaded to a local workstation for further work on software development, instruction extension, and performance analysis and optimization. The verification approaches used to validate Tensilica’s processor generation process and the resulting design outputs is described in more detail in [4]. A more highly automated specification creation approach called XPRES was introduced in 2004 [5]. XPRES takes as input application C and C++ code, and through automated analysis and design space exploration, will examine many possible Xtensa LX configurations as candidate processors for these applications. These are presented to the user in the form of a Pareto trade-off surface, and the user can choose that combination of performance, area and power consumption that best meets their application requirements. Alternatively XPRES may be used to produce
62
CHAPTER 5
a starting point for further manual configuration and instruction extension – one that may be optimised by designers by further exploring the ranges of configuration and instruction extension supported by Xtensa LX. For a number of years, Tensilica has supported basic instruction set simulator (ISS) and multi-processor ISS-based system modeling using a multi-threaded modeling environment called XTMP. XTMP allows designers to build single or multi-processor models with multiple instantiations of as many different Xtensa LX processor configurations as they wish. These can then be hooked together using defined TIE ports and queues, and by declaring memories, both system memories (using the PIF or processor interface), or local memories both private and shared. The system memory maps can be built up using a notion of connectors that represent the connection and address mapping of a shared bus, but is not an explicit bus model. In addition, user defined components can be modeled using XTMP modeling constructs and interfaced to memory ports, connectors or attached to queues or the XLMI local memory interface. XTMP is based on a C and C++ compatible Application Programmers Interface (API). XTMP uses two different threading libraries – one using quickthreads, or one using SystemC threads (which of course uses quickthreads to support its concurrency and schedule model). When an XTMP model is built using SystemC, SystemC threads and the appropriate driver, it is a full-fledged SystemC model, albeit one that imposes its own higher level timing semantics in order to allow designers to build correct system models where the cores can fire at appropriate times, where memories and user devices can respond to requests at appropriate times in the system modeling cycle, and where debugging, tracing and other important functions can operate correctly. This has allowed XTMP models using SystemC threading to be interfaced or wrapped to interface to other SystemC models and commercial SystemC toolsets as well as the OSCI reference SystemC simulator. Tensilica does not provide a standard proprietary processor bus of its own. Rather, it provides a tuned processor interface, PIF, which can be adapted to interoperate with a variety of commercial, proprietary and internal buses. PIF defines a split-transaction bus with a basic set of transactions appropriate to the Xtensa LX processor: reads, writes, block reads, block writes, and read-conditional-write. PIF is configurable to support 32, 64 and 128 bit interfaces. By defining appropriate bus bridges, PIF has been interfaced to standard ARM buses (AMBA), OCP, and many internal proprietary bus models. As an advanced split-transaction bus, interfacing PIF to newer buses such as ARM AXI is quite tractable. As well as the physical design interfaces to such buses, wrappers in SystemC or C/C++ may be built to interface to XTMP models so that bus interfaces may be modeled in this simulation environment. 2.
MULTI-PROCESSOR SoC
As discussed previously, the recent shifts in design architectures and the design methods used for complex SoC have seen the rise of multi-processor SoC and its application across a huge spectrum of the design space. This has spawned
THE CONFIGURABLE PROCESSOR VIEW OF PLATFORM PROVISION
63
workshops, books [6] and a host of commercial activity in “multi-core” designs and multi-processor based SoC. On a practical basis, Tensilica customers, for example, have been increasing their use of heterogeneous multi-processor based designs. A survey showed an average use of six processor cores among the many designs that used the Tensilica core. Often configurable and extensible cores are used for the data-intensive portions of a system where configuration and extended instructions have a large impact, while standard embedded RISC cores are used for the control parts of the system. Configurable cores are also used in more homogeneous systems. For example, the Cisco Systems CRS-1 router uses multiple chips each with 188 Tensilica processors on it for packet processing (plus 4 additional ones for yield management). In this case, the application of high end routing allows effective use of multiple processors each of which is used to process a single packet through to completion. We have also seen recently a tremendous growing interest in “multi-core” systems, where semiconductor vendors such as IBM, AMD and Intel have moved from offering single core processors with enhanced performance, to offering increased performance via 2, 4 or more cores processor cores in a single device. This is due mostly to the impossibility of offering increased performance via the traditional routes of increased speed, deeper pipelines and more parallelism in the basic microarchitecture. Increasing speed has hit a technology brick wall and the disproportionate increase in heat and energy dissipation has become a technology dead end. Some processors are now using 28 or more pipeline stages. And using additional parallel hardware through techniques such as speculative execution and predication also has tended to increase power consumption and heat at a faster rate than the increase in performance. The multi-core types of designs are often packaged in Symmetric MultiProcessors (SMP) – where the cores share a coherent view of the memory space via a number of hardware-based cache and memory coherency mechanisms. One of the more notable chips in this class is the SUN Niagara processor with up to 8 cores. This suits general purpose processing, as in high end desktop machines and servers, where applications can be divided into a few concurrent threads, following a SMT (Simultaneous MultiThreading) programming model; where threads can be reassigned to processors dynamically; where the tasks are general purpose, not realtime embedded, and not assignable to processors ahead of time. In this general-purpose processing model, it is best if processors are homogenous, support HW-based context switching for quick thread switching when they stall on memory access, and support a general large coherent memory space. The SUN Niagara, for example, allows 4 threads per core for up to 32 threads per multi-core processor. The extreme examples of SMP machines are large scientific processors (weather and earth simulation, nuclear and large scale physics research), where the applications may change much more quickly than the machines and very high performance memory subsystems using advanced speed interconnect allow processors to share a coherent memory space.
64
CHAPTER 5
In the embedded application space, however, a different approach is often optimal. This is the asymmetric multi-processing (AMP) approach. In AMP, tasks are often known ahead of time either in detail or in enough detail so that processors can be optimised for specific tasks that can be assigned to them. These processors can exploit configuration and instruction extension to allow enhanced performance and reduced power consumption. By differentiating the processors and using a variety of interconnection schemes, (not just bus-based coherent system memory), the subsystem for an embedded application can be optimised in all its aspects. Here, dataflow programming models (as opposed to SMT) are a better match for the heterogeneous multi-processors that are most appropriate. The classic example of heterogeneous multi-processing for embedded applications are cellphones, involving RISC processors for control and user interface, DSP’s for voice encoding and decoding, and a variety of specialised media processors (audio, video) for functions such as cameras, MP3 playing and video broadcast. As people shift to more and more heterogeneous MP platforms, the design tools and methods, and models required for their effective design, shift to a processorcentric design approach. The large increase in software focuses attention on instruction set simulators and speed-accuracy tradeoffs in ISSs. Multiple use cases need to be considered. First and foremost, designers need help in partitioning applications onto multiple processors. Whereas single processor applications were relatively easy to design in the past, and partitioning applications into SW on one processor and some accelerating HW could be modeled using ad-hoc methods, MP, especially with configurable Processors, causes a huge explosion in the design space. Designers must explore the number and types of processors and their configuration, the mapping of tasks to processors, and their communication and synchronization needs.
3.
BASIC IP AND PLATFORM MODELS
The fundamental IP model for processor-centric platforms is the cycle-accurate ISS model. Both in isolation and in combination for MP platforms, this gives the basis for designers to make fundamental judgements on the number and type of processors required for a system and the mapping of SW tasks to those processors. Cycle-accurate ISS models may have speeds of several hundred thousands of simulated processor cycles per second, up to the low millions, depending on the nature of processor being simulated. Although it is possible to build instructionaccurate ISS models, these are a secondary kind of model and cannot substitute for a cycle-accurate model. In order to be cycle-accurate – that is, to accurately reflect the processor state at the end of each simulated cycle – ISS models need to reflect pipeline accuracy. This requires simulating much more of the internal processor microarchitecture than an instruction-accurate model. Nevertheless, the use of C/C++ datatypes, and abstractions for internal micro-architectural elements,
THE CONFIGURABLE PROCESSOR VIEW OF PLATFORM PROVISION
65
as well as high-level programming languages, allows cycle-accurate ISSs to be anywhere from 100 to 1000+ times faster than an equivalent RTL model. The cycle-accurate model is the basis for accurate platform and system models, and for accurate profiling, instruction tracing, transaction tracing and verification of the performance and functional aspects of a system. As we have seen, there other other choices to be made on the speed-accuracy tradeoff axis, including instruction accurate models. Of growing importance for software verification and development are fast functional models of processors that can be put into a fast functional model of a system. This abstracts away much of the device and synchronization detail of the overall system but can offer another 1000x in overall system simulation performance while guaranteeing correct functional execution of code on each processor model. Together with processor models, good tradeoff options at the system level for speed and accuracy can be created using the relatively new concepts of “transaction-level modeling” [7]. Transactions can provide data abstractions for more efficient simulation using built-in language datatypes, and also replace less efficient simulation mechanisms such as event queue processing or cycle-based model firing with direct method calls for models when required. Transaction-level approaches are used in almost all fast simulation methods and have begun to be used by a number of design groups in different companies. Different types of transaction-level models, as discussed in [7] and other references, allow different speed-accuracy tradeoffs and support a variety of use models. One of the most important trends in recent years to promote a consistent modeling approach and support the delivery of transaction-level, interoperable IP models from a number of suppliers is the use of SystemC as the modeling notation. Indeed, the growing use of SystemC in Electronic System Level (ESL) design is a major factor in allowing users to construct the right kind of platform models. Thus ISS models must either be interfaced directly in SystemC, or come with SystemC wrappers, in order to satisfy this modeling need. Designers also need other kinds of models as well: components such as memories, DMA’s, and peripherals of all kinds are important. Perhaps most important next to the processor ISS models are standard and interoperable models of buses – ARM (AMBA AHB, APB, AXI), OCP, IBM (CoreConnect) and proprietary in-house buses used by design teams. By providing interoperable models in SystemC, SystemC becomes the “integration framework” or “platform” for pulling them all together. However, models are not enough. Unless two transaction-level models are built to speak the same kind of “transaction language”, it is guaranteed that these models will not interoperate. There have been some attempts by the Open SystemC Initiative (OSCI) to define a basic Transaction-level modeling (TLM) standard, the first version of which was released in 2005. However, the TLM 1 standard described basic transport of simple read and write transactions only, and does not address the true needs of ISS and complex model interoperability. Although there are rumours of further work in OSCI on a “TLM 2” standard, that would address the lacks in the TLM 1 approach, including standards for debugger integration, simulation
66
CHAPTER 5
control and a richer variety of basic transaction types, it is not clear when and if such a standard will emerge.
4.
LACKS: WHAT’S MISSING
The basic needs for an interoperable TLM modeling standard include: • Processor firing semantics: “who-whom” – which models get fired and when; who can initiate transactions (masters) and who can respond to them (slaves) or who can operate as both master and slaves (peers); when processors fire and when other user models fire. • A rich and realistic set of transaction types, including burst or block reads and writes, and conditional transactions, including an ability to configure transaction sizes and datatypes. • Debugging and performance monitoring and tracing APIs, so that models can be linked together in standard debuggers and issue transaction traces for postsimulation performance analysis. The problem with the lack of an interoperable TLM standard supported by a standards body such as OSCI is that various tool vendors will try to push their proprietary solutions as ‘standards’ but that none of them will achieve enough de facto usage and support to become a real standard. For example, ARM proposed a RealView ESL API proposal as a possible standard in November 2005. No doubt other ESL tool vendors such as CoWare, Mentor, Synopsys and Cadence will all want to push their own standards as the one true standard. In the absence of an interoperable standard, the usual anarchy will prevail, and IP vendors will end up having to support many TLM modeling tools and pseudo“standards”, depending on user demand and the tools that will prove to be most popular. The other key lacks are in commercial ESL tool offerings. ESL tools can be classified into four categories: • Algorithmic design, analysis and implementation. • Behavioural synthesis. • Virtual system prototyping. • SoC construction, simulation and analysis. The two that are most relevant to IP-based platform modeling and design are the virtual system prototyping and SoC construction, simulation and analysis categories. There are several commercial ESL tools that fall into the SoC “constructorsimulator” class. These often provide a graphical “platform constructor” approach based on a library of IP elements – processors, buses, memories, peripherals – which can be interconnected and configured in relatively small ways, to form a platform that can be simulated and generate transaction analysis traces for performance analysis. Such tools may come with reasonably extensive bus and model libraries. However, they do not easily support design space exploration at the early stages of processor-centric platform definition. They are most suitable to verify
THE CONFIGURABLE PROCESSOR VIEW OF PLATFORM PROVISION
67
microarchitectures that are already determined. They don’t help with software partitioning or processor optimization and configuration. They usually support classical, bus-based communications architecture only and don’t support new interconnect methods. Often they are easiest to use in single processor rather than multi-processor designs, and the software arrives as a “Deus ex machina”, endogenously, rather than being developed and modified within the design environment. Programming model support for software partitioning is often non-existend. They can be tedious to use, and there are too many such tools for a limited design environment, portending a shakeout to come and potentially the stranding of designs, IP and users in unsupported tools. The final key lack is in the area of standardized fast modeling approaches. The use model (SW development and verification) for fast functional ISS simulators, and resulting fast functional platform models, has already been discussed, and it is fundamentally different from the use model for detailed cycle-accurate models. The latter is more important for detailed performance analysis and HW-SW verification at the micro-architectural level. But as long as cycle-accurate models are available to support this last level of abstraction, fast functional simulation, offering 10x to 1000x performance improvement is extremely useful. Experience with processor ISSs show that at least 100x speedup is possible. But using them to build a fast functional system model is still very ad-hoc, and there are no standard interfaces or APIs to make this process easier, especially to allow synchronization of various device models to be more easily made compatible with fast functional simulation. There are some commercial providers of such models, but the fast functional models need to be hand crafted for each processor, which does not help the provider of configurable and extensible processors. As is the case with other models, the responsibility for provision devolves back to the IP provider.
5.
KEY DECISIONS AND REQUIREMENTS FOR MPSoC DESIGN
When we design an MP SoC based system, using a processor-centric design approach, we must answer several key questions: • How many processors are enough? • How do you configure/extend them for an application? • Are they homogeneous or heterogeneous? • How do they communicate? • What is the right concurrency model? – Pipelined Dataflow? – Multithreading? – Both? • How do you extract concurrency from applications? • How do you explore the design space? • (Eventually how do you move from 10 to 100 to 1000 or more processors?)
68
CHAPTER 5
Answering these questions requires sophisticated design space exploration capabilities centred on the right IP models and platform models. Such design space exploration requires the following capabilities and requirements: • Integrated Development Environment (IDE), allowing one to: – Define applications. – Configure processors and extend them. – Profile single applications on single processors. – Develop MP structure. – Map tasks to processors. – Generate simulations, launch, trace, analyse, iterate. – Multiprocessor executable linking, packing, and loading via test or memory infrastructure. • Standards for MP structural definition. – Eg. XML as in SPIRIT. • Standards for IP model interoperability. – SystemC, TLM, beyond OSCI TLM. • Both cycle accurate and fast simulators. – System analysis and verification and SW validation. • Abstract programming model(s). – Pipelined dataflow. – Multi-threaded. – More easily re-map tasks to processors. This kind of integrated development environment and capabilities may be best provided by IP developers as there is a lot of specific IP content in it. 6.
PROVISION OF DESIGN MODELS AND TOOLS, AND USE MODELS
Software, design tools, models and design methods are an indispensable part of any SoC platform, complementing the hardware models, methods and tools. Since processor-centric platforms emphasise both processors and software, and since configurable and extensible processor IP requires advanced tools and methods to make it possible to configure them and design with them, it is unlikely that generic commercial ESL tool providers will develop sufficient capabilities to support this design approach. Indeed, without sophisticated tools and methods, it is very unlikely that designers and architects would look at a configurable processor-based design approach. System level or ESL design tools are essential to support MPSoC design and we have argued that the generic industry may not be able to provide them. It is clear that the IP providers must step up to this task. There are at least two major design flows that can be envisaged for processorcentric platform-based design. One is a top-down, application and algorithmdriven design flow that starts with applications and ends up with an optimised MPSoC platform to support it. A complementary approach is a platform-creation,
THE CONFIGURABLE PROCESSOR VIEW OF PLATFORM PROVISION
69
“middle-out” approach where designers will specify a relatively generic MP platform with some small amount of processor configuration to fit a design domain (e.g. audio processing or video applications), will characterize it possibly using generic “traffic generation” software code and monitors, and supply it with associated tools and flows to platform users who customize it for applications with software programming. Other design flows are possible but these two basic approaches cover many opportunities. 7.
STATE OF THE ART AND FUTURE ROADMAP
Platform and IP providers today already deliver a wide variety of IP models and design tools. This is especially true of the configurable and extensible processor providers. However, the cost of such development is high, and is not made easier to bear in the absence of decent standards for model and tool interoperability. Here the EDA industry, which controls most EDA related standards, is letting the design community and IP industry down. Part of the problem for the EDA industry is that ESL may not represent a real commercially viable alternative for most of its subdivisions. While behavioural synthesis may come to be a commercial market place (substituting for a fair chunk of RTL level synthesis, which it uses as a back end for design completion), and algorithmic design and analysis has been commercially viable for several tools and competitors for many years (Mathworks Matlab and Simulink, CoWare SPW, Synopsys COSSAP and CoCentric), the other design approaches may never attract enough users to make them viable. The number of system architects is relatively limited and while this is a viable pool to service by the IP providers, it may not be attractive for the commercial industry. As discussed, standards would really help in this respect and also fuel more development of a commercial and generic tool industry because models would be provided by IP providers in such a way that they would interoperate in several different commercial ESL tools without requiring sufficient adaptation. However, the current standards regime of OSCI for SystemC, together with IEEE P1666 for the language itself, may be insufficient to deliver the TLM 2 standards and APIs that are required. This is a key area for industry development. 8.
CONCLUSIONS
From the perspective of processor IP provision, especially for configurable and extensible processors, the IP and platform providers must supply a rich, highly capable and automated design tool and model environment. This must encompass automated processor configuration, manual processor extension specification, automatic generation of hardware, verification models, ISSs, and the whole SW tool chain including compilers, debuggers and assemblers, all incorporated into a sophisticated integrated development environment. These must be complemented by MPSoC development tools for SW partitioning, mapping, simulation, tracing, and analysis.
70
CHAPTER 5
Commercial EDA providers are failing to provide real front-end design tools to support MP specification and configuration. If anything, they attack design niches or back end verification only. Therefore, despite the structural problems in the EDA industry and in standards, the IP industry must take the responsibility of providing what users need for effective MPSoC and platform based design. This may not be the most ideal future, but it certainly seems the most likely one. REFERENCES [1] Matthias Gries and Kurt Keutzer (editors), Building ASIPs: The MESCAL Methodology, Springer, 2005. [2] Chris Rowen and Steve Leibson, Engineering the Complex SOC, Prentice-Hall PTR, 2004. [3] Steve Leibson and James Kim, “Configurable processors: a new era in chip design”, IEEE Computer, July, 2005, pp. 51-59. [4] Dhanendra Jani, Chris Benson, Ashish Dixit and Grant Martin, Chapter 18, “Functional Verification of Configurable Embedded Processors”, In The Functional Verification of Electronic Systems: An Overview from Various Points of View, edited by Brian Bailey, IEC Press, February 2005. [5] David Goodwin and Darin Petkov, “Automatic Generation of Application Specific Processors”, CASES 2003, San Jose, CA, pp. 137-147. [6] Ahmed Jerraya and Wayne Wolf (editors), Multiprocessor Systems-on-Chips, Elsevier MorganKaufmann, 2005. [7] Frank Ghenassia (editor), Transaction-Level Modelling with SystemC: TLM Concepts and Applications for Embedded Systems, Springer, 2005.
CHAPTER 6 PERIPHERAL MODELING FOR PLATFORM DRIVEN ESL DESIGN
TIM KOGEL CoWare Inc.,
[email protected] Abstract:
1.
This article provides an overview of a SystemC-based Transaction Level Modeling (TLM) methodology for the rapid creation of SoC platform models. First a brief overview of the ESL design tasks and the corresponding modeling requirements is given. The main topic is a methodology for the efficient creation of transaction-level peripheral models. Those are usually specific for a particular SoC platform and have to be created by the ESL user
PLATFORM DRIVEN ESL DESIGN
Electronic System Level (ESL) design refers to a set of System-on-Chip (SoC) design tasks like embedded software development or architecture definition, which have to be addressed before the silicon or even the RTL implementation becomes available. Using ESL these tasks are performed by means of a transaction-level model of the SoC platform, which delivers the required simulation speed, visibility, and flexibility. This greatly improves the productivity to run and debug embedded software, investigate architectural alternatives, perform hardware/software integration, and validate the performance and efficiency of the resulting system. However, every user has to create a transaction-level model of the SoC platform before reaping the benefits of ESL design. The significant investment in modeling can be decreased by IP and ESL tool vendors by providing model libraries. This remedial action is limited to typical common-off-the-shelf IP blocks like processors and buses. The majority of the platform-specific IP blocks need to be modeled by the ESL user. This proved to be very problematic for the following reasons: • The current level of standardization is not sufficient to protect the investment in modeling. 71 M. Burton and A. Morawiec (eds.), Platform Based Design at the Electronic System Level, 71–85. © 2006 Springer.
72
CHAPTER 6
• Only limited resources are available for the creation of transaction-level models. • Each of the ESL design tasks has different modeling requirements in terms of accuracy and simulation speed. A standards-based methodology for the efficient creation of reusable transactionlevel peripheral models is therefore one of the most important prerequisites for the urgently required adoption of ESL design.
1.1
TLM Use Cases and Abstraction Levels
The discussion about the right level of abstraction for transaction-level modeling is truly a difficult one. In general, a modeling approach can be divided into domains for communication, data, time, structure, and functionality. Each domain can be represented at different levels of abstraction, e.g. the time can be untimed, timed, or cycle accurate, the data can be modeled as abstract data types, bursts-of-words, or words. To simplify this discussion, we first classify TLM according to use cases rather than abstraction level. This way the purpose of the model is in the center of attention. The Functional View (FV) is a TLM use case that represents an executable specification of the application, which is supposed to be executed on the SoC platform. The Functional View stands somewhat apart from the following three TLM use cases, because it is used for modeling the application. Programmers, Architects, and Verification View on the other hand are used to model the platform architecture. The Architects View (AV) is a TLM use case targeted to architectural exploration. Compared to an ad-hoc specification of the SoC architecture, the early exploration and quantitative assessment of architectural alternatives bears the potential of reducing the chip cost in terms of area and IP royalties. Additionally the AV use case mitigates the risk of late changes in the architecture due to missing the performance requirements. A model deployed in this use case should have sufficient timing information to enable exploration of architectural choices and trade-off analysis. Usually the processor cores are abstracted to traffic generators and File Reader Bus Masters to mimic the on-chip communication load. This approach minimizes the initial modeling effort and yields an acceptable simulation speed. The Programmers View (PV) is a TLM use case for embedded software design. Based on a “virtual prototype” of the SoC platform, companies can get a significant head-start in the embedded software development. The full visibility of the PV platform model greatly improves the debugging productivity of the embedded software developer compared to a development board or to an emulator-based solution. Additionally, the model of the new platform can be shipped to customers for early assessment of features. Here functional correctness is important for the elements of the model which are visible by the software. Additionally, the memory map needs to be modeled
PERIPHERAL MODELING FOR PLATFORM DRIVEN ESL DESIGN
73
correctly. The availability of fast Instruction-Set Simulators is a key prerequisite for this use case. The Verification View (VV) is a TLM use case for cycle-accurate system validation and HW-SW co-verification. The performance can be optimized by finetuning the configuration of the interconnect IP using a functionally complete and cycle-accurate transaction level model of the SoC platform. This helps to reduce the chip cost, because you can optimize the resources to achieve the required performance. The expected performance of the final design can be fully qualified prior to the implementation phase. This further reduces the risk of late changes in the architecture to meet requirements. Additionally, the block-based verification of the RTL implementation against the golden TLM model in the realistic system context reduces the effort for the creation of testbenches and reference models at the RT level. This improves the productivity of the verification engineers. Platform models for the VV use case are composed of cycle-accurate InstructionSet Simulators and bus models. This of course impacts the simulation speed, but the accuracy cannot be compromised. Starting from this use case-based classification it is much simpler to talk about the appropriate level of abstraction to achieve a particular design task. In general, it is not meaningful to position one use case at a “higher” or “lower” level of abstraction. Instead every use case has an optimal working point: • AV requires a certain degree of timing information to capture the anticipated performance of the system. The functionality is usually not that important, so the application can be represented by a non-functional workload model. • PV requires only very little timing but needs to be functionally complete. 1.2
Reuse-Driven Peripheral Modeling Methodology
In essence, CoWare’s Platform-Driven ESL Design paradigm combines the individual use cases into a consistent ESL design flow, where different design problems are solved using the appropriate TLM use case. The key enabler for this design paradigm is the reusability of design-specific peripheral models across multiple use cases. Otherwise the return of investment into modeling would not be sufficient. Additionally, the peripheral models must fulfill the requirements of the individual use cases in terms of simulation speed and accuracy. The reuse is achieved on the basis of two assumptions: The abstraction level of the SoC platform model is mostly influenced by the abstraction level of the deployed interconnect models as well as the abstraction level of the deployed Instruction-Set Simulator (ISS). More specifically, the accuracy and simulation speed inherent to the chosen models of the interconnect architectures and programmable architectures determine the aptitude of a TLM simulation model for a specific use case. For example: • A platform model constructed from instruction-accurate ISSs and PV bus models is only usable for software development, as the platform does not contain any timing information.
74
CHAPTER 6
• A platform model constructed from cycle-accurate ISSs and cycle-accurate bus models is only usable for verification purposes. For the software developer the simulation is way too slow. An architect would require higher flexibility to efficiently explore the design space. • A platform model constructed with cycle-approximate bus models and without an ISS is only usable for architectural exploration. The good news from this observation is that the accuracy of all the peripheral models is not really essential for the use case of the platform. The notable exception is the accuracy of the memory subsystem (like caches, memories, and memory controllers). We just have to make sure the peripheral models are fast enough to enable software development. The second assumption states that the encapsulation mechanisms provided by C++ in general as well as SystemC and TLM in particular enable a mutual separation of communication, behavior, and timing. In other words, we can decompose the problem of modeling a platform element into several orthogonal aspects. What is more, each of these aspects can be nicely supported by a set of well-defined interfaces and modeling objects. The good news from the second observation is that the communication interface and the timing model of a peripheral can be modified without too much impact on the behavior. As explained in more detail below, this enables a refinement strategy based on timing refinement and TLM transactors. These two general observation lead to the modeling strategy as depicted in Figure 1. The three different bus models in the middle represent the fact that the accuracy and the resulting simulation speed of the interconnect model determine the use case of platform simulation. Usually these different abstraction levels are not deployed in a single platform simulation. Instead, the three different bus models are part of the platform model library, from which the actual simulation for the respective use case is constructed. Most importantly, the peripheral models have only one representation in this platform library, i.e. they can be reused for the different use cases.
PV
behavior
storage and synch.
bus transactor
AV
bus transactor
timing
timing
bus transactor
bus transactor
timing
initiator
storage and synch.
behavior
timing
VV
timing
Figure 1. Reuse of Peripherals for Multiple Use Cases
timing
target
PERIPHERAL MODELING FOR PLATFORM DRIVEN ESL DESIGN
75
Figure 1 also shows the separation of the peripheral components into timing, behavior, and communication, which is discussed later in detail. The construction of different platform models from a consistent model library is the essence of CoWare’s Platform-Driven ESL Design paradigm.
1.3
The Importance of Standards
Every user is reluctant to tie his IP and system models to a vendor proprietary modeling style and tool environment. Only the standardization of model interfaces enables the interoperability of models from different origins. This lowers the barrier to invest in the creation of transaction-level models and therefore fosters the adoption of ESL design. The following list gives a brief overview of the relevant standards in the area of SystemC based TLM. The SystemC Language Reference Manual (LRM) standardizes the basic SystemC language itself. This is the minimum requirement to build SystemC simulators, but is by far not sufficient to achieve interoperability among transaction-level models. The SystemC Transaction Level Modeling (TLM) 1.0 standard defines the fundamental communication and synchronization constructs that can be used to create TLM interfaces. Usually a protocol layer is built on top of the basic TLM 1.0 interfaces, which provides a convenient API for a specific communication protocol or design task. The TLM 1.0 standard therefore does not enable the plug-and-play interoperability of TLM models, because different protocol APIs are used by the models. Still the definition of the basic TLM semantics facilitates the creation of transactors between different protocol APIs. Currently, the OSCI TLM working group is standardizing a set of generic protocol APIs for the Programmers View and Architects View use cases to further improve the interoperability. The OCP-IP SystemC channel library is an example of a protocol layer for the Open Core Protocol. It defines TLM interfaces at multiple levels of abstraction. The highest abstraction level represents a generic interface for architectural modeling. The SystemC Verification (SCV) library provides a standard interface for randomization and transaction recording. The modeling style and all objects in the modeling library described in this article are compliant with SystemC and leverage the other SystemC-based standards mentioned above. This means that their functionality can be built on top of SystemC without any changes to SystemC IEEE 1666.
1.4
A TLM Pattern for SoC Peripherals
The orthogonalization of concerns, i.e. the separation of different aspects of the design process, is generally considered to be the key ingredient to tackle the complexity of SoC design in a divide-and-conquer kind of approach. In this context, the separation of behavior and communication is a well-known concept, but proves to be difficult to implement. In order to improve the reusability of the
76
CHAPTER 6
behavior we have developed a TLM pattern, which further pushes the concept of orthogonalization. First, we propose to further decompose the communication aspect into a generic storage and synchronization layer and a protocol-specific transactor layer. The purpose of the storage and synchronization layer is to exhibit well-defined generic interfaces towards the behavior as well as towards the transactor. This effectively decouples the behavior from any particular bus interface. This way, a peripheral model can be hooked to a model of a different bus (or to a model of the same bus at a different abstraction level) by just replacing the bus transactor. Additionally, we propose to separate as much as possible the timing of the model from the actual behavior. This way, the timing accuracy of a model can be increased by adding timing information. Augmenting purely functional models with timing information is essential in order to reuse them for the AV and VV use cases. In summary, the TLM pattern separates peripheral models into the following orthogonal parts: the bus-specific transactor, a generic synchronization and storage layer, the actual behavior, and the timing information. Ultimately this fine-granular separation of the different modeling aspects leads to a scalable accuracy, which enables a seamless reuse of models across multiple levels of abstraction. In the following we will first present the TLM pattern for target and initiator peripherals and then talk about the modeling of timing. 2.
MODELING TLM TARGETS
Targets are SoC platform elements like memories, timers, interrupt controllers, or passive hardware accelerators, which are situated on the receiving end of the communication architecture. The TLM pattern for target peripherals is depicted in Figure 2.
bus interface
PV interface
PV interface bus transactor
storage and synchronization
behavior behavior1
bus
transaction
behavior2
processing
behavior3
timing
storage
alias
TLM Target Figure 2. TLM Pattern for Target Peripherals
timing
PERIPHERAL MODELING FOR PLATFORM DRIVEN ESL DESIGN
77
The generic TLM component pattern comprises the following aspects: • The communication part is modeled as a bus transactor that converts the specific communication protocol of the interconnect architecture into a generic communication TLM API. This generic API permits a simple and canonical processing of the transaction inside the component. • The memory-mapped registers and memories of the peripheral correspond to the storage and synchronization layer. • The behavior is modeled in terms of passive callback functions, which are activated when a particular region or field of the register interface is accessed. Of course, the behavior can also be registered directly to the generic interface without an intermediate register layer. • The timing layer can be realized in the transactor or as part of the behavior. The latter is usually required to model data dependent timing. Modeling timing is discussed in more detail later. This modeling pattern fits nicely to any memory mapped peripherals, as well as to blocks where functionality is triggered through a simple register write and read. We have selected the Programmers View API [1] as the generic interface between both the bus transactor and the register interface as well as between the register interface and the behavior. The reason for this choice is twofold: First, the PV API can be seen as the most simple communication interface for behavioral modeling. Yet the PV transaction data structures contain a sufficient number of attributes to capture the relevant aspects of any communication in a platform. A notably exception is of course the timing information of the communication, which is only included at the level of the Architects View or below. A PV-based peripheral modeling methodology greatly increases the modeling productivity, because the effort of creating a TL model for the PV use case is approximately only 1/7 to 1/10 compared to the effort of creating a synthesizable RT-level implementation model of the same component. Secondly, the PV API is used for the functional modeling of bus nodes. In case of a PV platform model we can omit the bus transactor and can directly hook the register interface to the PV bus node. This gives an extra boost to the simulation speed, which is required for the Programmers View use model. 2.1
Target Modeling Objects
The target modeling pattern can be nicely supported with a set of modeling objects for the storage and synchronization later. These memory objects provide the means to store data items. The data has to be readable and writable both from the bus transactor as well as from the behavior. Additionally, the memory objects are responsible for synchronizing behavior and communication. In case of a target peripheral this translates into the activation of behavior in response to communication events. As shown in the center of Figure 2, a memory object essentially acts as an array of data items. All kinds of accesses and arithmetical operators are defined, so the usage of the memory object is transparent for the behavior. From the communication
78
CHAPTER 6
side, the location of the array in the bus address map is defined at construction time. Hence, the memory object can autonomously decode incoming transactions and store or fetch the data item at or from the right location. This also covers the automatic processing of burst transactions, where a range of data items is written or read in one go. These features are usually sufficient to model plain storage elements like for example memories or passive register banks. Obviously, the processing of more complex transactions cannot be done automatically, but requires the activation of user-defined behavior. In this case, the memory object is able to activate certain portions of the behavior. This kind of invocation of passive behavior is realized by means of simple call-back functions (represented by the curved arrows). These are regular functions carrying the signature of a PV transport call, which model the behavior and which can be associated with a memory object. Whenever the memory object is accessed by a transaction, the registered callback is activated. However, it turns out that only one callback per memory object is not sufficient for practical usage. In case a memory object for example represents the control register of a peripheral, a specific behavior can be associated with an individual bit of this register. For this purpose, we have conceived the concept of an alias to a certain region or bitfield in the memory. An alias does not represent additional storage, but enables the fine-granular registration of call-back functions to arbitrary regions of the memory object. 2.2
Target Modeling Example
The small example depicted in Figure 3 illustrates the value of the target modeling objects in terms of modeling efficiency.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 Target_PV::Target_PV(sc_module_name & n ) : 2 sc_module(n), 3 p_PV("p_PV“), 4 m_RegBank(“RegBank“, scml_memsize(4)), 5 m_RWRegister(“RWReg“, m_RegBank,0,1), Module declaration 6 m_TRegister(“TReg“, m_RegBank,2,1) class Target_PV : public sc_module 7{ { 8 public : 9 m_RegisterBank.initialize(0); PVTarget_port
p_PV; 10 11 REGISTER_WRITE(m_RWRegister, WriteRegCB); private : 12 REGISTER_READ(m_RWRegister, ReadRegCB); scml_memory m_RegBank; 13 14 REGISTER_TRANSPORT(m_TRegister, scml_memory m_RWRegister; TransportRegCB ); 15 scml_memory m_TRegister; 16 17 p_PV(m_RegisterBank); ReadRegCB(accessSize, offset); 18 } WriteRegCB(d, accessSize, offset); Module construction RSP TransportRegCB( const REQ &request ); };,
Figure 3. Target Peripheral Example
79
PERIPHERAL MODELING FOR PLATFORM DRIVEN ESL DESIGN
The module declaration in the left part of Figure 3 shows the declaration of the memory objects (lines 6–9) and the callback functions (lines 11–13). The right part shows the construction of the register file memory map (lines 4–6) and the registration of the callback functions to the alias registers (lines 11–15). Note that m_RWRegister and m_TRegister are alias registers, because they are constructed with m_RegBank as a parent object. The implementation of the callback functions representing the behavior is not show. The behavior boils down to the actual functionality, since most of the modeling overhead in terms of address decoding, endianess conversion, error handling, and debug messages is implemented in the scml_memory objects. According to our experiences the usage of memory objects significantly reduces the code size and improves the modeling productivity by about 30% compared to a plain PV modeling style. 3.
MODELING TLM INITIATORS
Initiators are SOC platform elements like dedicated processing elements, DMA controllers, or traffic generators, which actively insert transactions into the communication architecture. Programmable cores are modeled as Instruction Set Simulators (ISSs) and are not considered in this article. The following Figure 4 shows the proposed TLM modeling pattern for the modeling of non-programmable initiator. The four parts of the initiator TLM pattern are defined as follows: • The user-defined behavior is modeled in terms of autonomous SystemC processes, which actively initiate transactions. • The storage and synchronization layer is composed of a post port and an initiator storage element called scml_array. The posting of transaction is non-blocking
post interface
bus interface bus transactor
behavior behavior
transaction
transaction queue
storage and synchronization scml_array
timing
data interface
TLM Initiator Figure 4. TLM Pattern for Initiator Peripherals
transaction processing
timing
bus
80
CHAPTER 6
and merely specifies the necessary transaction attributes. The real synchronization of the behavior is based on the availability of data and space in the scml_array objects. • The communication part is modeled in the bus transactor. Here posted transactions are queued and converted into actual bus transactions according to the respective bus protocol. The bus transactor operates on the same storage elements as the behavior to avoid needless copying of data. • The timing layer can be realized by means of clocks modeling objects or any mechanism for implicit and explicit timing annotation provided by the TLM communication API. The interaction between the behavior on the one hand and the synchronization and storage layer on the other hand is not obvious. Unfortunately, the initiator side of the PV initiator API is too simplistic to enable reusability at different abstraction levels. The blocking semantics of the PV transport call rule out the possibility to refine the communication to a more realistic bus protocol. In order to enable reuse of initiator peripherals we have developed a generic TLM initiator API, which combines the non-blocking posting of transaction with the synchronization on data and space availability in the scml_array objects. The transaction data structure is a subtype of the regular PV data structure with a few additional attributes.
3.1
Initiator Modeling Objects
In analogy to the scml_memory on the target side, scml_array is the key modeling element on the initiator side. In general terms, the major purpose of this object is to synchronize the data exchange between a producer and a consumer. Hence, scml_array can be seen as a data manager for data items which are stored in an array. Similar concepts can be found in the Task Transaction Level (TTL) interface defined by Philips [2]. The hardware analogy of scml_array would be a data buffer, where the incoming and outgoing data is stored. This paragraph briefly introduces the concept of synchronization on data and space availability using scml_array. As shown in Figure 5, a producer first needs to claim the required space in the array before it can write the data. This is a blocking operation, which does not return before the space is available. On the other side, a consumer has to claim the data before it can read the data items. This claim_data operation is also blocking, which only returns when the producer releases the region in the scml_array (see the upper dotted arrow in Figure 5). Now the consumer owns the valid data, so a subsequent claim_space on the producer side only returns when the consumer releases the array region (see the lower dotted arrow in Figure 5). By means of scml_array, the producer and the consumer can communicate through a shared memory region without additional copy operations. During the
PERIPHERAL MODELING FOR PLATFORM DRIVEN ESL DESIGN
producer
consumer scml_array
...
81
...
dm.claim_space(0,4);
dm.claim_data(0,4);
for (int i=0;i<4;i++) {
for (int i=0;i<4;i++) {
dm[i] = ...;
... = dm[i];
}
}
dm.nb_release_data(0,4);
dm.nb_release_space(0,4);
...
...
dm.claim_space(0,4);
Figure 5. Synchronization of Data Access Using scml_array
period between the claim_space and release_data operations the producer module can use the same memory just like any regular array container for the actual processing. In the context of modeling TLM initiators, scml_array separates the userdefined behavior from the bus transactor. During a write operation, the behavior acts as producer and the bus transactor represents the consumer. In case of a read transaction, on the other hand, the bus transactor produces the data requested by the behavior. Note that the actual data is shared between producer and consumer to improve the simulation speed. Note also that the claim operations never actually block in case a scml_array-based initiator is attached to a pure PV system, because all communication is happening without delay. Hence in a pure PV system no SystemC event notification occurs, which is again important to maintain high simulation speed. Regrettably, the discussion of an example for the initiator TLM pattern is beyond the space constraints of this article. 4.
MODELING TIMING
The entry point for the creation of SoC peripheral models is the Programmer’s View API. These purely functional models need to be augmented with timing information in order to reuse them for the AV and VV use cases. Hence the upgrading of functional models with timing information is a cornerstone of the overall PlatformDriven ESL Design modeling methodology. In general we distinguish 3 different ways of adding timing to a functional peripheral: 1. Explicit Timing using SystemC objects like clocks and events. 2. Static Timing Annotation using annotation in the transactors. 3. Dynamic Timing Annotation using annotation in the behavior. The techniques 2 and 3 are definitely the preferred ones, because they do not compromise the simulation speed. This way, the timing annotated models are still applicable for the Programmers View use case. The details of the timing annotation techniques for PV and AV use cases are already discussed in [3], so we will restrict this article to an overview of the major concepts.
82 4.1
CHAPTER 6
Timing Annotation Principles
In general, timing annotation refers to the specification of delays as part of the TLM API instead of using wait or other forms of delayed event notification in SystemC. The benefit of this approach is, that the realization of the timing is deferred from the component to the interconnect models. Now these interconnect models can decide to ignore the annotation (in the PV use case) or to translate the annotated timing into delays (AV use case) or cycles (VV use case). The difference between explicit and annotated timing is illustrated in Figure 6. In both cases the processing latency in the target peripheral is in one way or the other dynamically computed depending on the actual data and the internal state of the peripheral. • In the explicit case (left), the peripheral itself calls wait. Here the peripheral also needs to know about the clock period to translate the number of cycles into an actual delay. The delayed event notification could be avoided by means of a run-time or compile-time configurable conditional statement. However all of this does not help the usability and maintainability of the peripheral models. • In the annotated case (right) the peripheral annotates the delay to the latency attribute in the PV response data structure. This way, the respective caller of the transport method can decide to use or ignore this information. A PV bus node for example would ignore the annotation whereas an OCP transactor would consider it (see below). Also the clock-period information can be maintained in a more central place. 4.2
Timing Annotation Parameters
In principle we could define an arbitrary number of timing parameters. However, an excessive number of parameters would be complex for the peripheral models to compute as well as complex for the interconnect models to interpret. In practice, we made good experience to restrict the granularity of the timing annotation to
Annotated PV target model PVResp& transport(const PVReq& req) { PVResp resp = req.obtainResp(); Timed PV target model // do processing PVResp& transport(const PVReq& req) {// ... PVResp resp = req.obtainResp(); unsigned int latency = ...; // do processing resp.latency = latency; // ... return resp; unsigned int latency = ...; } wait(latency * clk_period); return resp; }
Figure 6. Timing Annotation
83
PERIPHERAL MODELING FOR PLATFORM DRIVEN ESL DESIGN
the boundaries of transactions. This way, two timing parameters characterize the performance of any peripheral: • The accept delay specifies the minimum time between two consecutive activations. In essence, the accept delay constraints the bandwidth of a block, that is, during this period a module is busy with the processing of an activation. • The latency specifies the time between the process activation and the sending of the result. In this way, the timing requirements of arbitrary platform building blocks can be roughly specified. For example, a pipelined ASIC block will exhibit an accept delay smaller than the response delay, whereas for a task executed on a programmable core it will be the other way around. According to our experiments, also sophisticated timing behavior can be modeled using these two parameters. The timing of a memory controller for example is pretty complex and depends on numerous state variables and timing parameters. Still the timing of an individual memory access can be quite accurately characterized by computing the respective accept and response delay. Naturally, the concept of a timing annotation scheme based on only two parameters has certain limits in terms of accuracy. In some cases it might be unavoidable to model the timing on a cycle-by-cycle level to reach the required accuracy. However, these models become hard to maintain and are not reusable any more. 4.3
Transactors
As already indicated in the Figure 1, transactors play an important role to facilitate the reuse of peripherals for multiple use cases. Apart from the plain translation between two TLM APIs, the transactor acts as a performance overlay model for the untimed PV peripherals.
OCP Channel
OCP Slave
PV Initiator
PV Target
transactor_target_AV_PV
accept
while (true) { wait(p_ocp->RequestStart()); p_ocp->getRequest(ocp_req); pv_req=convert(ocp_req); pv_rsp=p_pv->transport(pv_req); ocp_rsp=convert(pv_rsp);
rsp transport(req)
accept = ... latency = ... p_ocp->acceptRequest(accept); p_ocp->sendResponse(ocp_rsp, latency);
latency }
Figure 7. Transactor as a Performance Overlay Model for PV Peripherals
84
CHAPTER 6
Figure 7 illustrates this concept by means of a simplified transactor for hooking PV target peripherals to OCP TL3 channels (please refer to [3] for more detailed explanations). The three dots represent the configurability in the calculation of the timing model. On the one hand, the delay could be based on static or stochastic values configured in the constructor of the model. In this case the transactor serves as a rough, firstorder approximation for any untimed peripheral. On the other hand, the PV target could already be annotated with timing information. In this case the transactor should evaluate the timing annotation in the PV response data structure. Creating a transactor is usually not a trivial task. The TL3/PV target transactor is relatively simple, because the TL3 API also supports timing annotation. In other cases, the transactor needs to deal with timing in a more explicit way. 5.
SUMMARY: AN ESL DESIGN FLOW
In this article we have presented a re-use driven methodology for the transactionlevel modeling of SoC peripherals. Under the assumption that Instruction-Set Simulators and bus models at different levels of abstractions are available from IP and ESL tool vendors, peripheral modeling is an essential ingredient to ease the adoption of ESL Design. The investment in peripheral modeling can be justified much more easily when multiple platform models for different purposes can be built from the same model base. In other words, the individual ESL design tasks like architecture exploration, software development, and verification should be combined into a seamless ESL design flow. An ideal ESL design flow is depicted in Figure 8. The Functional View transactionlevel model of the algorithm represents the optional starting point. In case an
SW
HW
Functional View
mapping mapping
refinement
Architects View Programmers View
extraction
extraction refinement
extraction
refinement
Verification View
Figure 8. TLM Platform Models in the ESL Design Flow
PERIPHERAL MODELING FOR PLATFORM DRIVEN ESL DESIGN
85
executable specification of the application is available (e.g. from an algorithm development environment with SystemC export capabilities) it can be used by architects to define the right HW/SW partitioning and to explore the system architecture. Alternatively, the Architect’s View model of the platform can be constructed from non-functional models of the application workload. The focus of the AV platform model is usually on the interconnect architecture and the memory subsystems. These peripherals should already deploy the modeling methodology outlined in this article to be re-usable for further ESL design tasks. A platform model for software development can be extracted from the AV platform model by integrating instruction-accurate processor simulators and replacing the AV-level bus with a simple PV bus. The peripheral models are reused by removing the PV/AV transactors. This effectively eliminates any timing information that is not important for the embedded software designer in order to achieve the highest possible simulation speed. On the other hand, additional peripherals need to be added to the platform in order to complete the programmer’s model of the platform. Once the complete set of peripherals is available, a cycle-accurate model of the platform can be constructed. Here a cycle-accurate interconnect model needs to be integrated and the instruction-accurate ISSs need to be replaced by their cycle-accurate counterpart. The peripherals can again be re-used by introducing the corresponding transactors and refining the timing annotation to an appropriate level. The resulting VV platform model can be used by verification engineers to perform HW/SW as well as TLM/RTL co-verification. Obviously, this flow assumes the availability of transaction-level models for the processor and interconnect architectures on different levels of abstraction. Fueled by the progress of the standardization in IEEE, OSCI and OCP-IP, the amount of models available from IP providers and ESL tool vendors is constantly increasing. The methodology described in this article leverages currently available TLM API standards as much as possible to benefit from this SystemC ecosystem. ACKNOWLEDGEMENTS As indicated by the constant usage of plural pronouns, the modeling methodology including the corresponding modeling library, documentation, and examples described in this article are the result of a significant development effort by the CoWare modeling team (in alphabetical order): Serge Goossens, Pavel Pokorny, Paul Stynen, Dave Upton, Jan Van Eetvelde, Karl Van Rompaey, and Bart Vanthournout. I would like to thank Tom De Schutter and Linda Schildermans for the review of the manuscript. REFERENCES [1] Frank Ghenassia (Ed.), “Transaction-Level Modeling with SystemC", Springer, 2005. [2] Pieter van der Wolf et al., “Design and Programming of Embedded Multiprocessors: An InterfaceCentric Approach", CODES+ISSS, 2004. [3] Tim Kogel, Anssi Haverinen, James Aldis, “OCP TLM for Architectural Modeling", OCP-IP whitepaper, 2005, available from www.ocpip.org
This page intentionally blank
CHAPTER 7 QUANTITATIVE EMBEDDED SYSTEM ARCHITECTURE AND PERFORMANCE ANALYSIS
GRAHAM R. HELLESTRAND VaST Systems Technology Corporation, Sunnyvale, CA 94070, USA
1.
SYNOPSIS – THE ARCHITECTING OF SYSTEMS
In the context of this chapter, a system is software dominated electronic engine for control, computation and communication, capable of interacting with the real world to capture data, transform it, and recommunicate it to the world. In its simplest form, the digital subsystem responsible for interpreting software, implementing algorithms and sampling and conditioning information for external communication can be represented as a Turing machine – essentially a collection of intercommunicating finite state machines (FSMs). In its more pragmatic form it is composed of (i) a skeleton constituted from the weavings that make up its communication and infrastructure fabric, (ii) devices attached to the fabric that are structured to support the intent and effective operation of the engine, and (iii) transducers that support the interactions required between the engine and the real-world. Of course, the latter definition is more general than that obliged to describe a digital system. Now, the architecting of the system is the structuring the skeleton, assemblages of devices, and input/output transducers to provide an engine capable of meeting its business and technical requirements. Relatively simple systems can be intuitively architected – rarely to produce an optimal system, but an acceptable one. Complex systems cannot be intuitively architected. The modern electronic engineering landscape is populated with engines that are over-engineered, unduly expensive, and plain incompetent. The architecting of optimal systems is an empirically driven, science-based discipline where hypothesis (desirable insights) about the system is set-up to be refuted and decisions are driven purely by data. This chapter is about the quantitative architecting of complex, software dominated systems. 87 M. Burton and A. Morawiec (eds.), Platform Based Design at the Electronic System Level, 87–100. © 2006 Springer.
88 2.
CHAPTER 7
CRITICAL REACTIVE EMBEDDED SYSTEMS (CRES)
Within the phylum of systems, embedded systems form a large class, and criticallyreactive embedded systems form a small sub-class of embedded systems. Embedded systems are control systems embedded in products and devices that are implemented using software and hardware. Apart from network connectivity and local memory, embedded systems are usually stand-alone and have prescribed interfaces for user interaction and programming – like a cell phone, camera or a home printer. About 95% of all processors produced annually are embedded in these systems. Embedded systems may be simple – a simple bus fabric connecting a single processor, memory and peripheral devices, some of which are connected, via transducers, to subsystems in the real world that collect data from and provide control capability to; others are complex as will be seen below in examples of wireless and automotive control systems. Critically reactive embedded systems constitute that small subset of embedded control systems that have incontrovertible timing requirements that must be met. This is a severe constraint that has a far reaching impact on architecture, design, algorithm selection, development and implementation. Architecture addresses the structure and function (behaviour and timing) of software and hardware (digital, analog and I/O transducers) of the control systems, and the mechanical, RF (radio frequency) and other devices with which the control system interacts. 3.
THE EVOLVING EMBEDDED DESIGN PROCESS
Modern embedded systems are largely software dominated in function, complexity, engineering effort, and support. This was not always the case, until about 5 years ago, the design of the precursors of today’s embedded systems are hardware dominated system in which architecture was driven as part of hardware design. The design process was sequential – hardware first, software following – and, as software began to dominate in complexity and (exponentially) in size, even while hardware increased in complexity (with a lesser index), the development times grew
Units of Resources
Conventional Sequential H/W-S/WProcess 160 140 120 100 80 60 40 20 0 -20 1
Risk
40
Resources
799
Units ManWeeks
Project Period
12
Periods
Poly. (Architecture) Poly. (Overall Project) Poly. (Software/Firmware) 2
3
4
5
6
7
8
Project Time (6-week periods)
9
10
11
12
Poly. (Hardware -ASIC Devel) Poly. (Systems Integration + V&V)
QUANTITATIVE EMBEDDED SYSTEM ARCHITECTURE AND ANALYSIS
89
to unsupportable levels. The graph below shows the resource deployment (human & capital) in projects, in this case a 2.5G mobile phone, towards the end of life of the sequential design process – a process that had proven efficacious for hardware dominated designs [1]. There are several indices of the portending doom of this process: (i) the peak resource deployment occurring in the last quarter of the predicted design process; (ii) the software engineering requirement dominating the critical project path compounded by the strictures of the sequential design process; (iii) the decreasing time to market opportunity for globally competitive products; and (iv) market capacity to absorb technology driven products that command premium pricing with short term inelastic demand. Factors (i) through (iii) are captured as Risk, expressed in the following equation: ResourceVariance Risk = RemainingProjectTime@PeakResourceDeployment 2 TotalProjectTime The formulation of the risk equation is motivated from the Black-Scholes option pricing equation (price of options is proportion if variance in underlying asset value), modified to reflect the critical nature of the timing of peak resource deployment in a project). The square root is a normalizing function. Although this formulation of risk is interesting, it is one such formulation. A standard measure of project risk is yet to be determined.
4.
CRES IN WIRELESS AND AUTOMOTIVE
Two significant and diverse architectural examples of CRESystems are described below. The first is a controller for a hypothetical cell phone perhaps supporting multiple communications stacks simultaneously; the second is a distributed controller supporting feedback between some major subsystems in a car – this example is not hypothetical but it has been somewhat abstracted. Closely-coupled multiprocessor wireless platform. This system is complex and contemporary and is implemented on a single piece of silicon. It supports two general purpose processors (an ARM1156 – perhaps for real-time control and games, and an ARM1176 – perhaps for applications processing) and two DSP processors (a StarCore SC2400 – 3–4G modem stack implementation, and an SC1400 – perhaps for a second simultaneously supported modem stack or video and/or audio decode). To support high bandwidth, low latency communication between masters devices which are processors and slave devices, such as memories is a primary requirement of such architectures. This is a major distinction between this architecture and that of the following example. The processors (master devices) need to be connected to slave devices (such as memory, timer, etc.) with support for various communication bandwidth and latency
90
CHAPTER 7
D ROM ARM1176 Virtual Processor Model
ARM1156 Virtual Processor Model
D ROM
P ROM
P ROM
StarCore SC1400 Virtual Processor Model
StarCore SC2400 Virtual Processor Model
I Cache C
D Cache
I Cache
D Cache
I Cache
D Cache
I Cache
D Cache
StdBus I/F
StdBus I/F
St StdBus I/F
StdBus I/F
StdBus I/F
StdBus I/F
StdBus I/F
St StdBus I/F
Master
Master
Master
Master
Slave
Slave
Slave
Slave
Combinational Communication & Infrastructure Fabric
Master
Master
Master
Master
Slave
Slave
Slave
Master StdBus Bridged
Memory Block
StdBus Bridged
Memory Block
Shared Memory
P2 Memory
P1 Memory
Memory Block
Memory Block
Console 1
Console 2
UART
UART
TIM E R TIMER
TIMER
INTC P1 Devices
INTC P2 Devices
configurations. The requirement to connect multiple master devices to multiple slave devices gives rise to a general interconnection schema – the most known being a m × n combinational interconnect, such as cross-bar. A number of general interconnect schema supporting combinations of bandwidth and latency has been implemented, including ARM’s AMBA AXI [2]. Since this is a general and critical interconnecting capability in silicon, we will briefly discuss its implementation. Firstly, where a number of metal layers are available, there are a surprising number of simple geometrical configurations of, even very wide, m × n interconnects that avoid capacitive and inductive coupling and are space economical. That is separate bundles of wires with no protocol each connect one (of m ports on the interconnect block to all n target ports. Registers, multiplexers and arbitrations units can be attached to ports to effect various information transfer control strategies through the interconnect block. Secondly, to attach master and slave devices, through the interconnect block, requires a protocol for high performance (high bandwidth, low latency), point-to-point communication. On silicon, 2/4-cycle signaling elements are simple and provide the minimal control required, via a pair of cause-effect signals – such as information valid (i.valid) and information consumed (i.consumed), to implement reliable, high performance information transfer. In 4-cycle (asynchronous) signaling, the i.valid signal is asserted by a transfer initiator (producer) function (T.initiator), which services a master device wanting to transfer stable data to a slave device, and is transmitted to a transfer completer (consumer) function (T.completer), which services a slave device wanting to receive data from the master device. When the slave device has consumed the transferred data, the T.completer function asserts the i.consumed signal and transmits it to the T.initiator function which then deasserts the i.valid signal causing the T.completer function
QUANTITATIVE EMBEDDED SYSTEM ARCHITECTURE AND ANALYSIS
91
to deassert the i.consumed signal. Further analysis may be found in [3]. Now to control a high performance, reliable communication channel enabling a master device to transfer information (write) to a slave device connected through an m × n interconnect block, the following functional cause-effect chain is appropriate to use: ivalid
TinitiatorMdev −−−−→ Tcompleter → InterconnectBlock → Tinitiator ivalid
−−−−→ TcompleterSdev iconsumed
TinitiatorMdev ←−−−−−− Tcompleter ← InterconnectBlock ← Tinitiator iconsumedSdev
←−−−−−−−− TcompleterSdev In this example the T.initiatorMdev is part of the master device and the T.completerSdev is part of the slave device. It is assumed that for a write function, data accompanies the i.valid signals. The Interconnect Block may be arbitrarily complex with many Initiator-Completer pairs, with each pair using the same protocol. It is not necessary to merge the T.initiator and T.completer into the master and slave devices. However, when separate, some synchronization mechanism needs to bind them together in order to provide a reliable channel. The separation of devices from the communication mechanism supports a high degree of variability in configuring bandwidth and latency to meet the requirements of specific communication channels. As well, there are likely to be multiple time domains operating across platforms; the inherent asynchronous nature of the initiator-completer pairs facilitates the bridging between such domains. These attributes can be used to advantage by tools that build devices and platforms – such as VaST’s Peripheral Device Builder and Platform Constructor Tool, respectively. Multiple controller, distributed automotive control platform: This system differs almost completely from the wireless example. Although in more recent developments, multiple processor controllers are being designed into power-train applications, in general the electronic control units in a car use relatively low speed, single processors that communicate via low bandwidth, high latency communications channels to provide the required critical, real-time feedback control. However, the requirement that control and safety functions in a car must meet specified hard deadlines places real-time constraints on the communication protocols. To meet this requirement the time-triggered protocols implemented on CAN-TT and FlexRay buses are increasingly being deployed. The purpose of modeling such a system is to determine the required bandwidth and latency of the interconnect structure and the computation attributes and capabilities of the control units, under both extreme and average conditions. If the engine controller and the suspension controller each formed part of a distributed stability controller for a car, then safety, reliability and requirement to meet real-time schedules must be established under the harshest conditions.
92
CHAPTER 7
MATLAB/Simulink Suspension Model & Display
MATLAB/Simulink Processor (VPM)
CAN Controller
Engine Model CPU Bus CP
Suspension Controller Processor (VPM) Processor (VPM)
CANController
CPU Bus
Can-TT Bus
CAN CA Controller
CPU Bus
Panel Controller
Engine Controller
It should be noted here that experiments performed on the model are able to cover many more cases than testing using a physical artifact. This is a surprising result to most engineers. But the controllability of a model and the ability to configure extreme situations, as well as the ability to run data acquired from a physical artifact through the model, enable more comprehensive testing to be undertaken. Often this testing on the VSP is at, or better than, real time – since most subsystems in a car have relatively slow response rates to events, the VSP running on a modern PC will outperform the slow silicon used to implement the physical controller. The exception is in power train which will require multi-host simulation capability – an area in which VaST has an active research program. The contrast between the two example gives an interesting insight into the modeling required of processor, interconnects and devices and the wide variety of characterization of objective functions and experimentation required to drive optimization. One constant, however, remains. Models need to the timing accurate and high performance to be used for architectural experimentation and for developing software for real-time control systems. Software developed on inaccurate and/or incomplete models cannot be guaranteed to run on silicon; such models cannot guarantee anything about real-time capabilities – meeting of hard schedules, power consumption, performance, bandwidth, throughput, inter alia.
93
QUANTITATIVE EMBEDDED SYSTEM ARCHITECTURE AND ANALYSIS
5.
THE CRES DESIGN PROCESS
The design process underlying a modern CRES system development is empirically driven and differs from the conventional process by initially developing an executable architecture of the whole system – hardware, software, I/O transducers. Then deriving a high performance, timing accurate system golden reference model – known as a Virtual System Prototype (VSP) – to concurrently design the synthesizable hardware (including the I/O transducers) and complete development the software. A side-effect of this process is that the VSP becomes the golden test bench including for register transfer hardware designs. VaST’s version of the System Development Process is shown below:
Business Requirements
Software Architect and Test
+ Virtual Prototype + COMet
(executable)
COMet System Level Design Tool
Develop and Test
METeor
Initial System Architecture
Virtual Prototype Executable System Specification
Design and Test
Architect and Test
Design and Test
Develop and Test
Silicon Hardware Platform + Embedded System Software
Functional Requirements
Hardware
This process has a profound effect on the efficacy of the engineering of CRES systems. The process curves below reflect the modern style of system development for the same 2.5G mobile phone discussed for the conventional embedded systems design process. Comparing the Conventional and Concurrent design styles results in some dramatic changes. Firstly, architecture – involving hardware, software and transducers – is represented separately as a significant up-front activity that commands a lot of resources. The initial development of the architecture, requires porting target operating systems and legacy software and the development of, at least, stub drivers for projected hardware – that is, software development begins well before hardware design. The peak of the resource deployment curve now occurs in the first third of the project, reflecting the determinism derived from having a system defined and credible project plans in place early on. The net effect of the process change, for this project, was a decrease in resource requirements of about a 35%, a decrease in time to market of around 40% and a decrease in project risk of a massive 75%. These ranges of figures are consistently reported by VaST’s customers for this caliber of product development.
94
CHAPTER 7
Units of Resources
VaST Concurrent H/W-S/W Process
Risk
10
Units
120.0
Resources
520
ManWeek
100.0
Project Period
9
Period s
80.0
Poly. (Architecture)
60.0
Poly. (Software/Firmware)
40.0
Poly. (Hardware - ASIC Devel)
20.0 0.0 –20.0
Poly. (Systems Integration + V&V)
2
1
2
3
4
5
6
7
8
9
10
R = 0.0043
11
12
Poly. (Overall Project)
Project Time (2-month periods)
A side note is that for this process to perform, the VSP needs to be timing accurate: clock cycle accurate for processor; clock edge accurate for transactions through buses and bus bridges; and timing accurate for peripheral devices and I/O transducers. The objective is to get the model running with less than 1% timing variation from the silicon – with recognition that loads – particularly bus, memory and device loads – will affect the silicon timing when compared with the synthesized register transfer hardware design. Concomitantly, VSPs also need to be high performance: 50–200+MIPS for the processor, 1.5 million transactions/sec for buses and bridges, and peripheral devices depending what they need to do. If the VSP is slow, the experimentation at the architectural end will be incomplete, likely seriously so, resulting in irreparably deficient specifications in competitive markets. Specification which are incompetent in either timing or function and become the golden reference, will problematically drive the software and hardware engineer processes. The effect is likely worse than the intuitive engineering process that drives the old hardware dominated design and that has been well documented [4] in producing late and functionally incomplete designs (more than 50% of the time), project cancellations (20% of the time), product respins (more than 40% of the time), and grossly over-engineered products. As products become more complex, these numbers climb. The Architecture Driven Design Process is the first huge step in addressing this problem. Quantitative driving of this process further refines it to produce optimal control systems from optimal Virtual System Prototypes. This is the subject of the next Section. 6.
QUANTITATIVE SYSTEMS ARCHITECTURE
The use of empiricism in developing optimal software is rare. The complexity of processor centric, electronic systems that control modern products requires a systematic approach to developing systems (software and hardware) in order to deliver an optimal fit for an intended product. When engineering is critical to survival in global, competitive markets, building optimal systems is a requirement – over engineering cost time to market and money; under engineering fails to achieve the intended purpose of the product; and worst, under engineer by intuitively
QUANTITATIVE EMBEDDED SYSTEM ARCHITECTURE AND ANALYSIS
95
over-engineering selected subsystems – a remarkably frequent case – costs time and money and still fails to produce a competent product [5]. The intuitive optimization of systems – architecture, software design, hardware design, and interfaces – has largely been driven from hardware design. Since hardware designers have rarely been required to understand, the software that would run on their architectures, they tended to speculatively over-engineer perceived performance critical subsystems. In relatively simple systems, this history driven design style produced working architectures of unknown optimality. This approach to systems design has largely been spectacularly unsuccessful for complex designs. The ability to support data-driven decision making early in the system development process is one of the underlying drivers of building models of systems that are timing accurate and high performance. It is known that software and algorithms have a 1st order effect on an embedded system’s performance; similarly hardware (platform) design has a 1st order effect on the system. This is difficult to reconcile with practice, when next-generation product planning often focuses on processor micro-architecture, regardless of the fact that 5%–8% improvement gained from iterative micro architecture development affects the system negligibly – a 2nd or 3rd order effect. Deriving best-case, average and worst-case system performance from simulation using VSPs enables analyses of system variability leading to the identification of factors having the most significant impact on variability. These factors are prognostic as well as diagnostic and they can be used to drive the optimization of systems. In order to use such factors to drive the optimization process, a great number of experiments – involving many simulation runs – on various configurations of a system may be required. The design of experiment methodology [6] helps by providing a statistically valid mechanism for dramatically reducing the number of experiments needed to be performed. Since VSPs are used to directly execute software, including hard real-time code, during development and debugging, trace information (streamed from nonperturbing probes inserted into the model) – including response latencies, power consumption, speed between markers, frequency of function calls, etc. – is produced alongside the usual debug data and hence is available to software and systems engineers as a normal part of the edit-compile-execute-debug software development cycle. This changes the perspective of where optimization should occur – as a normal part of the development cycle, not as a post development clean-up. 6.1 6.1.1
Optimization of Systems Event-based, objective functions
In an event driven simulation environment, an objective function can be expressed as a function whose parameters are functions each characterizing contributions to the objective function of one of the components constituting the system, viz. CPUs, buses, bus bridges, memories and peripheral devices. The parameter functions themselves
96
CHAPTER 7
have parameters that are functions of simulation event types sourced from the various event activities that occur in a VSP during simulation. In general, an objective function is a function of functions of functions of events and has the form FPower fCPU cc=0cn fCPUcc CEvType=1cet gCPUcc
CEvType
CEvCnt=scecntcecn EventCPUccCEvTypeCEvCnt
fBus bc=0bcn fBusbc BEvType=1bet gBusbcBEvType BEvCnt=sbecntbecn EventBusbcBEvTypeBEvCnt fBusBridge bbc=0bbcn fBBusbbc BBEvType=1bbet gBBusbbcBBEvType BBEvCnt=sbbecntbbecn EventBBusbcBBEvTypeBBEvCnt fMem mc=0mcn fMemmc MEvType=1met gMemmcMEvType MEvCnt=smecntmecn EventMemmcMEvTypeMEvCnt fDev dc=0cn fDevdc DEvType=1 det gDevdcDEvType DEvCnt=sdecntdecn EventDevdcDEvTypeDEvCnt
(1)
wherefCPUk EvType=1et gCPUkEvType = fCPUk gCPUk 1 gCPUk 2 gCPUk et
of Equation (1), below. This is actually a generic function that computer power consumption. 6.1.2
Binding the generic power function to a specific architecture
A variant of the wireless architecture discussed above was instrumented. One experiment investigated the power consumed by the ARM1156 – the 3 other processors were held in reset for the duration of this simple experiment [7]. The basic function computed is Instant Power which calculates the total energy consumed over some period of time or some number of events (such as cycles). The functions computed that are useful for optimization purposes are: • Maximum power consumed, over a particular period (maximum of the instant powers). • Average power consumed over the whole experiment. A specific function used to compute instant power per k-cycles is given in the Equation 2: fPower =WPipe × fPipe + WInstr × fInstr + WCache × fCache + WTLB × fTLB + WRe
gAcc
× fRe
gAcc
+ WMemAcc × fMemAcc + WPeriphAcc × fPeriphAcc
where. fInstr =2 × fInstrjmp + 2 × fInstrexcept + 0 × fInstrctrl + 12 × fInstrcoproc15 + 0 × fInstrLdSt + fInstrarith + fInstrother and. fInstri =
instructions of typei in k − cycles
Similar functions occur for fPipe fCache fTLB fRegAcc fMemAcc fPeriphAcc and the weights for the constituent accumulating functions are given in Table 1, and the weights (Wi for each of the classes of functions contributing to fPower have been set to the constant function 1 in this study. In industry studies, the accumulating function might be replaced with individual functions relevant to computing power in ways not considered for the simple
QUANTITATIVE EMBEDDED SYSTEM ARCHITECTURE AND ANALYSIS
97
Table 1. Power: Function Types, Event & Weighting Functions Function Types
Events
Weight Functions
Pipeline Instruction Types
ibase ijmp iexcept ictrl icoproc iundefs imemrd imemwt imemrw iarith iother Cache_lookup icache_hit icache_miss dcache_hit dcache_miss line_fill tlb_miss regfile_access membus_transaction
6.0 2.0 2.0 0 12.0 0 0 0 0 1.0 1.0 fi−dcache (size, ways) iCache-lookup + ficache (line size, decode) Icache_lookup Dcache_lookup + fdcache (size, ways, line size) Dcache_lookup 0 30.0 1.0 50.0
periphbus_reg_access
50.0
Caches (I&D)
TLB Register Memory (incl. bus transactions) Periph Device (incl. bus transactions)
examples of this paper. Such functions can include history and implementation dependent technology functions. Similarly, the weights (Wi may be more complex functions – for example, the cache hit weights are functions of cache structure (size, wayness, policies). 6.1.3
The use and interpretation of event-based optimization functions
Event Bindings are simple to implement, typically a pointer to a function and a history buffer of events. However, the extraction of data from register transfer (RT) models or representative samples of the silicon is difficult, time consuming and subject to experimental errors. The inability to map an event in a behavioural model to an observable data point in the silicon or RT model further complicates the building of accurate tables. Events, in event-based simulation, are associated with aggregate underlying behaviour of the circuits that the semantics of events are intended to describe. The number sources of Event Types generated by all types of components in a complex platform may number in the hundreds; the number sources of Events Instances (events caused by the simulation of instances of typed components instantiated in a platform) numbers in the thousands, possibly many thousands. Since component instances, even though morphologically similar or identical, may have different
98
CHAPTER 7
electronic instantiations and be affected by local circuit connectivity, the objective function table may remain large. To set appropriate Event Bindings for entries in the Objective Function Table, the knowledge and skills of the silicon vendors are required. Since the precise physical association of an event with some circuit implementation that it models is not necessarily obvious and, in many cases, unlikely to be independent of circuit effects associated with other events. The ascription of local physical semantics to an event is more in the nature of a verisimilitude. The higher the level of event-based, behavioural abstraction, in general, the weaker is the physical (or structural) connection to implementation. This is a good thing for high performance modeling and, with forethought, does not compromise timing accuracy at an agreed level of granularity – say clock-cycle level. However, it does mean that more sophisticated mechanisms are required to efficiently predict aggregate underlying hardware concepts, such as power. In addition, various event types and instances may correlate highly across a variety of behaviours – for instance, cache misses with processor initiated bus traffic, data path stalling and power consumption. The elimination of dependence between event instance sources, that are the potential independent variables in a statistical analysis, should have the effect of condensing the number of variables required to explain a behaviour, or to be used in the prediction of behaviour as well as reduce the propensity for over estimation using data extracted from these models. 7.
AN EXAMPLE OF QUANTITATIVE INVESTIGATION
The Speed – instructions/10-cycles (IPC10 and relative Power Consumption of 9 structural variants of an experimental VSP similar to that used to describe the power calculation above were computed while booting Linux. The variants were selected from the full set of variants determined by – cache size: 1k, 8k, 32k; cache line: 16B, 32B; Memory configured as DDR (1st word delayed 5-cycles, 2nd word available per 1/2 cycle) and SDR (1st word delayed 5-cycles, 2nd) word available per 1 cycle); bus data width 4bytes. The results are shown in the Graphs 2A & 2B.
Graph 2B: Power Consumption - Linux Boot on ARM926E Subsystem of Fig. 1 VSP 2.00
2.60 2.40 2.20 2.00 1.80 1.60 1.40 1.20 1.00
CL = 16B, Mem = DDR CL = 32B, Mem = DDR CL = 32B, Mem = SDR
Ave. Power * 10^7/ # Instructions
Instructions / 10-Cycles
Graph2A:VPM Speed - Linux Boot on ARM926E Subsystemof Fig.1 VSP
CL= 16B, Mem = DDR
1.80 1.60
CL = 32B, Mem = DDR
1.40
CL = 32B, Mem = SDR
1.20 1.00
0
10,000
20,000
30,000
Cache Size(Bytes)
40,000
0
10,000
20,000
30,000
Cache Size (Bytes)
40,000
QUANTITATIVE EMBEDDED SYSTEM ARCHITECTURE AND ANALYSIS
99
The boot sequence of Linux spends more than 50% of its time executing with the I&D caches disabled. Linux performs initialization of the cache after the Initial Program Load, kernel load and the device driver installations. Once the operating system has booted and the idle loop is executing, the behaviour of the VSP is much the same as its behaviour running Viterbi – that is the working-set size is compatible with any cache size. As is also expected, in an environment where the working set size of the target code greatly exceeds the cache size, the impact of the memory hierarchy on power and speed is considerable. For booting Linux, the settings of the ARM926E VSP subsystem: cache size (32 kbytes), cache line size (32bytes), and Memory (DDR) yield minimum power consumption and maximum VSP speed. To minimize cost, as well, a cache size (I&D) of 16 kbytes would proportionally reduce silicon cost by about 30% and adversely affect both power and speed by about 1%. To further optimize for cost, cache sizes of 8 kbytes will yield a further ∼25% reduction in silicon with a worsening in power consumption and speed of 5%–10%. The overall results are expected, but the quantification of the results cannot be derived intuitively – these can only be gotten from a timing accurate model that has sufficient performance to enable large, representative software loads to be run. These results should be used to drive the optimization of the VSP – hardware, software and IO interfaces.
8.
SUMMARY AND CONCLUSION
The use of high performance, timing accurate models of multi-processor and distributed systems is part of the standard methodology for building complex critical reactive embedded electronic control systems. The requirement to develop software prior to the design of hardware and to comprehensively test the device, preclude the architecting and development of such systems physically. The use of virtual system prototypes (VSP) as the golden reference for system development – software and hardware – is the logical outcome of this discussion. In the quest for developing systems that work and solve the problems they are design to solve, empirical experimentation is the only methodology humans have evolved to ensure either desired outcomes or reasons for the failure to produce a desirable outcome. In this methodology, the refutation of carefully constructed hypotheses is the centre piece that needs to drive the quantitative engineering process – this is just the scientific process. Intuitive design, however intelligent, is no substitute for engineering decision making driven by hard data. If hypothesis building concerns speed, power consumption, reaction time, latency, meeting real-time schedules, etc. the model needs to be timing accurate (processor, buses, bus bridges and devices). If the extensive execution of software is an intrinsic part of the empirical experimentation, then the model needs to have high performance across all components.
100
CHAPTER 7
Optimizing systems with complex objective functions is not intuitive. Complex tradeoffs between hardware structure and the software and algorithms that are executed on the hardware cannot be done by ratiocination or formal analysis alone, the acquisition of data as part of well-formed experiments refuting thoughtfully constructed hypotheses (ratiocination) enables decision making driven by results. Optimization comes from considering hardware, software and IO interfaces together – not separately. The process driven by VSP as the golden architecture itself drives radical change in the engineering systems of companies. This change offers huge savings in resources – human and capital, time-to-market, and fitness, safety and reliability for use of the resulting products. The reduction of project risk is of overwhelming importance. In the situation where advanced engineering companies with high infrastructure costs and inflexible decision making are unable to compete with newer entrants having lower costs and greater flexibility, the VSP driven architecture process offers a strategic mode where intelligent, data driven activity dictates architectures that can then be contracted out for piecemeal realization – without divulging the consolidated intellectual property constituting the system or the decision making process that resulted in the optimal family of designs. Not just a desirable strategy but a mandatory one. REFERENCES [1] Hellestrand, G.R. The Engineering of Supersystems. IEEE Computer, 38, 1(Jan 2005), 103-105. [2] ARM AMBA AXI Protocol Specification. 2004. www.arm.com [3] Hellestrand, G.R. Event, Causality, Uncertainty and Control. 221-227, Proc.2nd Asia Pacific Conf. on Hardware Description Languages (APCHDL’94) Toyohashi, Japan, 24–25 Oct. 1994. [4] Venture Data Corporation. Collett International Report on Embedded Systems. 2004. [5] Hellestrand, G.R. Systems Architecture: The Empirical Way – Abstract Architectures to ‘Optimal’ Systems. ACM Conf. Proc. EmSoft2005, Sept 2005, Jersey City, NY. [6] Montgomery, D.C. Design and Analysis of Experiments. 5th Ed. John Wiley & Sons, NY, 2001. [7] Hellestrand, G.R., Seddighnezhad, M and Brogan, J.E. Profiles in Power: Optimizing Real-Time Systems for Power as well as Speed, Response Latency & Cost. Power Aware Real-time Conference (PARC2005), Sept 2005, Jersey City, NY.