Network Processor Design Issues and Practices Volume 3
Network Processor Design: Issues and Practices Editors: Patric...
205 downloads
1014 Views
5MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Network Processor Design Issues and Practices Volume 3
Network Processor Design: Issues and Practices Editors: Patrick Crowley, Washington University; Mark A. Franklin, Washington University; Haldun Hadimioglu, Polytechnic University, Peter Z. Onufryk, Integrated Device Technology, Inc. Responding to ever-escalating requirements for performance, flexibility, and economy, the networking industry has opted to build products around network processors. To help meet the formidable challenges of this emerging field, the editors of these volumes created the Workshop on Network Processors, a forum for scientists and engineers to discuss the latest research in architecture, design, programming, and use of these devices. Network Processor Design: Issues and Practices Volume 1, ISBN: 1558608753 Edited by Patrick Crowley, Mark A. Franklin, Haldun Hadimioglu, Peter Z. Onufryk Volume 1 contains not only the results of the first workshop but also specially commissioned material that highlights industry’s latest network processors. Network Processor Design: Issues and Practices Volume 2, ISBN: 0121981576 Edited by Patrick Crowley, Mark A. Franklin, Haldun Hadimioglu, Peter Z. Onufryk Volume 2 contains 20 chapters written by the field’s leading academic and industrial researchers, with topics ranging from architectures to programming models, from security to quality of service. Network Processor Design: Issues and Practices Volume 3, ISBN: 0120884763 Edited by Patrick Crowley, Mark A. Franklin, Haldun Hadimioglu, Peter Z. Onufryk Volume 3 is devoted to the latest academic and industrial research, investigating recent advances in networking, telecommunications, and storage.
g Network Processor Design Issues and Practices Volume 3
Edited by
Patrick Crowley Mark A. Franklin Haldun Hadimioglu Peter Z. Onufryk
AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann is an imprint of Elsevier
Publisher Denise E. M. Penrose Publishing Services Manager Simon Crump Editorial Assistants Summer Block and Valerie Witte Cover Design Ross Carron Design Cover Image Getty Images Text Design Windfall Software Composition Newgen Imaging Systems (P) Ltd. Technical Illustration Newgen Imaging Systems (P) Ltd. Copyeditor Eileen Kramer Proofreader Jacqui Brownstein Indexer Kevin Broccoli Printer The Maple-Vail Book Manufacturing Group The programs, procedures, and applications presented in this book have been included for their instructional value. The publisher and authors offer NO WARRANTY OF FITNESS OR MERCHANTABILITY FOR ANY PARTICULAR PURPOSE and do not accept any liability with respect to these programs, procedures, and applications. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital or all capital letters, or in a specific combination of upper- and lowercase letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Morgan Kaufmann Publishers An imprint of Elsevier 500 Sansome Street, Suite 400 San Francisco, CA 94111 www.mkp.com ©2005 by Elsevier, Inc. All rights reserved. Printed in the United States of America 09 08 07 06 05
5 4 3 2 1
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, scanning or otherwise—without prior written permission of the Publisher. Library of Congress Control Number: 2003213186 ISSN: 1545-9888 ISBN: 0–12–088476–3 This book is printed on acid-free paper.
About the Editors Patrick Crowley received his B.A. from Illinois Wesleyan University, where he studied Mathematics, Physics, and Computer Science; and his M.S. and Ph.D. degrees, both in Computer Science and Engineering, from the University of Washington. Crowley’s research interests are in the area of computer systems architecture, with a present focus on the design and analysis of programmable packet-processing systems. He is an active participant in the architecture research community and a reviewer for several conferences and journals. He was an organizer and member of the program committee of the HPCA Workshop on Network Processors in 2002, 2003 and 2004. In Autumn 2003, Dr. Crowley joined the faculty of the Department of Computer Science and Engineering at Washington University in St. Louis as an Assistant Professor. Mark A. Franklin received his B.A., B.S.E.E., and M.S.E.E. from Columbia University, and his Ph.D. in Electrical Engineering from Carnegie-Mellon University. He is currently at Washington University in St. Louis where he is in the Department of Computer Science and Engineering, and is the Hugo F. and Ina Champ Urbauer Professor of Engineering. He founded the Computer and Communications Research Center and, until recently, was the Director of the Undergraduate Program in Computer Engineering. Dr. Franklin is engaged in research, teaching, and consulting in the areas of computer and communications architectures, ASIC and embedded processor design, parallel and distributed systems, and systems performance evaluation. He is a Fellow of the IEEE, a member of the ACM, and has been an organizer and reviewer for numerous professional conferences including the Workshops on Network Processors 2002, 2003 and 2004. He has been Chair of the IEEE TCCA (Technical Committee on Computer Architecture), and Vice-Chairman of the ACM SIGARCH (Special Interest Group on Computer Architecture). Haldun Hadimioglu received his B.S. and M.S. degrees in Electrical Engineering at Middle East Technical University, Ankara, Turkey and his Ph.D. in computer science from Polytechnic University in New York. He is currently an Industry Associate Professor in the Computer and Information Science Department at the Polytechnic University. From 1980 to 1982, he worked as a research engineer at PETAS, Ankara, Turkey. Dr. Hadimioglu’s research and teaching interests include computer architecture, parallel and distributed systems, networking, and ASIC design. He was a guest editor of the special issue
vi
About the Editors
on “Advances in High Performance Memory Systems,” IEEE Transactions on Computers (November 2001). Dr. Hadimioglu is a member of the IEEE, the ACM, and Sigma Xi. He has been an organizer of conferences, workshops, and special sessions, including MICRO-35 (2002), ISCIS-17 Special Session on Advanced Networking Hardware (2002), the ISCA Memory Wall (2000), ISCA Memory Performance Issues (2001, 2002), and HPCA Workshop on Network Processors (2002, 2003, 2004). Peter Z. Onufryk received his B.S.E.E. from Rutgers University, M.S.E.E. from Purdue University, and Ph.D. in Electrical and Computer Engineering from Rutgers University. He is currently director of the New Jersey design center at Integrated Device Technology, Inc., where he is responsible for system architecture and validation of processor-based communications products. Before joining IDT, Dr. Onufryk was a researcher for thirteen years at AT&T Labs—Research (formally AT&T Bell Labs), where he worked on communications systems and parallel computer architectures. These included a number of parallel, cachecoherent, multiprocessor, and data flow based machines. Other work there focused on packet telephony and early network/communications processors. Dr. Onufryk is a member of the IEEE, has been a reviewer for numerous professional conferences, and an organizer of special sessions and workshops including the HPCA Workshops on Network Processors (2002, 2003 and 2004). He was the architect of several communications processors as well as the architect and designer of numerous other ASICs, boards, and systems.
Contents
1
About the Editors Preface
v xv
Network Processors: New Horizons
1
Patrick Crowley, Mark A. Franklin, Haldun Hadimioglu, Peter Z. Onufryk 1.1 1.2 1.3 1.4
2
Architecture 3 Tools and Techniques Applications 5 Conclusions 7 References 7
4
Supporting Mixed Real-Time Workloads in Multithreaded Processors with Segmented Instruction Caches Patrick Crowley 2.1
2.2
2.3
2.4 2.5
Instruction Delivery in NP Data Processors 11 2.1.1 Fixed-Size Control Store 11 2.1.2 Using a Cache as a Fixed-Size Control Store 12 Segmented Instruction Cache 13 2.2.1 Segment Sizing Strategies 14 2.2.2 Implementation 14 2.2.3 Address Mapping 16 2.2.4 Enforcing Instruction Memory Bandwidth Limits 17 Experimental Evaluation 17 2.3.1 Benchmark Programs and Methodology 17 2.3.2 Segment Sizing 18 2.3.3 Sources of Conflict Misses 22 2.3.4 Profile-Driven Code Scheduling to Reduce Misses 23 2.3.5 Using Set-Associativity to Reduce Misses 25 2.3.6 Segment Sharing 27 Related Work 29 Conclusions and Future Work 30 References 30
9
Contents
viii
3
Efficient Packet Classification with Digest Caches
33
Francis Chang, Wu-chang Feng, Wu-chi Feng, Kang Li 3.1 3.2
3.3
3.4
3.5
4
Related Work 34 Our Approach 35 3.2.1 The Case for an Approximate Algorithm 36 3.2.2 Dimensioning a Digest Cache 37 3.2.3 Theoretical Comparison 37 3.2.4 A Specific Example of a Digest Cache 39 3.2.5 Exact Classification with Digest Caches 41 Evaluation 42 3.3.1 Reference Cache Implementations 44 3.3.2 Results 46 Hardware Overhead 49 3.4.1 IXP Overhead 49 3.4.2 Future Designs 50 Conclusions 51 Acknowledgments 52 References 52
Towards a Flexible Network Processor Interface for RapidIO, Hypertransport, and PCI-Express Christian Sauer, Matthias Gries, Kurt Keutzer, Jose Ignacio Gomez 4.1
4.2
4.3
Interface Fundamentals and Comparison 57 4.1.1 Functional Layers 57 4.1.2 System Environment 59 4.1.3 Common Tasks 59 Modeling the Interfaces 59 4.2.1 Click for Packet-Based Interfaces 61 4.2.2 PCI Express 62 4.2.3 RapidIO 65 4.2.4 Hypertransport 66 Architecture Evaluation 68 4.3.1 Micro-Architecture Model 69 4.3.2 Simplified Instruction Set with Timing 69 4.3.3 Mapping and Implementation Details 70 4.3.4 Profiling Procedure 71
55
Contents
ix
4.4
5
4.3.5 Results 72 4.3.6 Discussion 76 Conclusions 77 Acknowledgments 78 References 78
A High-Speed, Multithreaded TCP Offload Engine for 10 Gb/s Ethernet
81
Yatin Hoskote, Sriram Vangal, Vasantha Erraguntla, Nitin Borkar 5.1 5.2
5.3 5.4
6
Requirements on TCP Offload Solution 83 Architecture of TOE Solution 87 5.2.1 Architecture Details 87 5.2.2 TCP-Aware Hardware Multithreading and Scheduling Logic 92 Performance Analysis 95 Conclusions 97 Acknowledgments 98 References 98
A Hardware Platform for Network Intrusion Detection and Prevention Chris Clark, Wenke Lee, David Schimmel, Didier Contis, Mohamed Koné, Ashley Thomas 6.1
6.2
6.3
Design Rationales and Principles 100 6.1.1 Motivation for Hardware-Based NNIDS 100 6.1.2 Characterization of NIDS Components 101 6.1.3 Hardware Architecture Considerations 103 Prototype NNIDS on a Network Interface 104 6.2.1 Hardware Platform 104 6.2.2 Snort Hardware Implementation 106 6.2.3 Network Interface to Host 107 6.2.4 Pattern Matching on the FPGA Coprocessor 109 6.2.5 Reusable IXP Libraries 110 Evaluation and Results 110 6.3.1 Functional Verification 111
99
Contents
x
6.4
7
6.3.2 Micro-Benchmarks 111 6.3.3 System Benchmarks 114 Conclusions 115 References 116
Packet Processing on a SIMD Stream Processor
119
Jathin S. Rai, Yu-Kuen Lai, Gregory T. Byrd 7.1
7.2
7.3
7.4 7.5
8
Background: Stream Programs and Architectures 120 7.1.1 Stream Programming Model 120 7.1.2 Imagine Stream Architecture 121 AES Encryption 122 7.2.1 Design Methodology and Implementation Details 7.2.2 Experiments 125 7.2.3 AES Performance Summary 130 IPv4 Forwarding 131 7.3.1 Design Methodology and Implementation Details 7.3.2 Experiments 134 7.3.3 IPv4 Performance Summary 138 Related Work 139 Conclusions and Future Work 140 Acknowledgments 142 References 142
123
132
A Programming Environment for Packet-Processing Systems: Design Considerations Harrick Vin, Jayaram Mudigonda, Jamie Jason, Erik J. Johnson, Roy Ju, Aaron Kunze, Ruiqi Lian 8.1
8.2
Problem Domain 147 8.1.1 Packet-Processing Applications 147 8.1.2 Network Processor and System Architectures 8.1.3 Solution Requirements 149 Shangri-La: A Programming Environment for Packet-Processing Systems 150
148
145
Contents
xi 8.3
8.4
9
Design Details and Challenges 152 8.3.1 Baker: A Domain-Specific Programming Language 152 8.3.2 Profile-Guided, Automated Mapping Compiler 158 8.3.3 Runtime System 164 Conclusions 168 References 169
RNOS—A Middleware Platform for Low-Cost Packet-Processing Devices
173
Jonas Greutert, Lothar Thiele 9.1 9.2
9.3
9.4 9.5
10
Scenario 174 Analysis Model of RNOS 175 9.2.1 Application Model 176 9.2.2 Input Model—SLA, Flows, and Microflows 9.2.3 Resource Model 181 9.2.4 Calculus 182 Implementation Model of RNOS 187 9.3.1 Path-Threads 188 9.3.2 Scheduler 188 9.3.3 Implementation 189 Measurements and Comparison 192 Conclusions and Outlook 193 Acknowledgments 194 References 194
179
On the Feasibility of Using Network Processors for DNA Queries Herbert Bos, Kaiming Huang 10.1 Architecture 198 10.1.1 Scoring and Aligning 199 10.1.2 Hardware Configuration 201 10.1.3 Software Architecture 203 10.1.4 Aho-Corasick 207 10.1.5 Nucleotide Encoding 209
197
Contents
xii 10.2 10.3 10.4 10.5
11
Implementation Details 210 Results 211 Related Work 215 Conclusions 216 Acknowledgments 217 References 217
Pipeline Task Scheduling on Network Processors
219
Mark A. Franklin, Seema Datar 11.1 The Pipeline Task Assignment Problem 221 11.1.1 Notation and Assignment Constraints 221 11.1.2 Performance Metrics 223 11.1.3 Related Work 224 11.2 The Greedypipe Algorithm 225 11.2.1 Basic Idea 225 11.2.2 Overall Algorithm 226 11.2.3 Greedypipe Performance 227 11.3 Pipeline Design with Greedypipe 228 11.3.1 Number of Pipeline Stages 229 11.3.2 Sharing of Tasks Between Flows 230 11.3.3 Task Partitioning 230 11.4 A Network Processor Problem 232 11.4.1 Longest Prefix Matching (LPM) 233 11.4.2 AES Encryption—A Pipelined Implementation 236 11.4.3 Data Compression—A Pipelined Implementation 238 11.4.4 Greedypipe NP Example Design Results 239 11.5 Conclusions 242 Acknowledgments 243 References 243
12
A Framework for Design Space Exploration of Resource Efficient Network Processing on Multiprocessor SoCs 245 Matthias Grünewald, Jörg-Christian Niemann, Mario Porrmann, Ulrich Rückert 12.1 Related Work
247
Contents
xiii 12.2 Modeling Packet-Processing Systems 249 12.2.1 Flow Processing Graph 249 12.2.2 SoC Architecture 253 12.3 Scheduling 255 12.3.1 Forwarding Flow Segments Between PEs 256 12.3.2 Processing Flow Segments in PEs 257 12.3.3 A Scheduling Example 259 12.4 Mapping the Application to the System 261 12.5 Estimating the Resource Consumption 264 12.6 A Design Space Exploration Example 268 12.6.1 Application and System Parameters 268 12.6.2 Results 271 12.7 Conclusions 275 Acknowledgments 275 References 276
13
Application Analysis and Resource Mapping for Heterogeneous Network Processor Architectures
279
Ramaswamy Ramaswamy, Ning Weng, Tilman Wolf 13.1 Related Work 282 13.2 Application Analysis 283 13.2.1 Static vs. Dynamic Analysis 284 13.2.2 Annotated Directed Acyclic Graphs 285 13.2.3 Application Parallelism and Dependencies 13.2.4 ADAG Reduction 287 13.3 ADAG Clustering Using Maximum Local Ratio Cut 287 13.3.1 Clustering Problem Statement 288 13.3.2 Ratio Cut 289 13.3.3 Maximum Local Ratio Cut 290 13.3.4 MLRC Complexity 291 13.4 ADAG Results 291 13.4.1 The PacketBench Tool 291 13.4.2 Applications 294 13.4.3 Basic Block Results 295
285
Contents
xiv 13.4.4 Clustering Results 296 13.4.5 Application ADAGs 299 13.4.6 Identification of Coprocessor Functions 299 13.5 Mapping Application DAGs to NP Architectures 302 13.5.1 Problem Statement 302 13.5.2 Mapping Algorithm 303 13.5.3 Mapping and Scheduling Results 304 13.6 Conclusions 306 References 306
Index
309
Preface This volume is the third in a series of texts on network processors. The series is an outgrowth of the annual Workshop on Network Processors and Applications, the third of which (NP-3) was held in conjunction with the 10th International Symposium on High-Performance Computer Architecture (HPCA10), in Madrid, Spain, on February 14 and 15, 2004. The book begins with a chapter that summarizes current issues and reviews the twelve chapters that make up the remainder of the book. Our goal is to provide a useful text that balances current academic and industrial research and practice. Our target audience includes scientists, engineers, and students interested in network processors, related components, and applications. Interest in network processor-related research is growing, as illustrated by the robust and sustained workshop attendance and manuscript submission rates, particularly during a difficult economic climate. Therefore, we have decided to organize an international symposium tentatively titled “Symposium on Architectures for Networking and Communications Systems” in Fall 2005. This book owes a great debt to the many people who made NP-3 possible. The program committee consisted of the four editors of this volume, along with 15 distinguished researchers and practitioners in the fields of networking and computer architecture: Alan Berenbaum (Agere), Brad Calder (UC-San Diego), Andrew Campbell (Columbia University), Jordi Domingo (UPC, Spain), Jorge Garcia (UPC, Spain), Marco Heddes (Transwitch Corporation), Manolis Katevenis (FORTH and University of Crete, Greece), Bill Mangione-Smith (UCLos Angeles), Kenneth Mackenzie (Reservoir Labs), John Marshall (Cisco Systems), Lothar Thiele (ETH-Zürich, Switzerland), Jonathan Turner (Washington University in St. Louis), Mateo Valero (UPC, Spain), Tilman Wolf (University of Massachusetts), and Raj Yavatkar (Intel). The workshop program also included a keynote address by Nick McKeown of Stanford University, an invited talk by Marco Heddes of Transwitch Corporation, and an industry panel session moderated by Mark Franklin. The panelists were: Mitch Gusat (IBM, Zürich, Switzerland), Marco Heddes (Transwitch, USA), Jakob Carlstrom (Xelerated, Sweden), Peter Onufryk (IDT, USA), and Raj Yavatkar (Intel, USA). We would like to extend our thanks to the workshop program committee members, the speakers, the panelists, the workshop authors, and HPCA-10 organizers. Without their help and dedication, this book would not exist.
xvi
Preface
Our special thanks also go to those at Morgan Kaufmann Publishers who once again helped us create this book. These include Denise E.M. Penrose, Publisher; Summer Block, Editorial Assistant; Valerie Witte, Editorial Assistant; and Simon Crump, Publishing Services Manager. Patrick Crowley Mark A. Franklin Haldun Hadimioglu Peter Z. Onufryk
1
Network Processors: New Horizons
CHAPTER
Patrick Crowley, Mark A. Franklin Washington University in St. Louis Haldun Hadimioglu Polytechnic University Peter Z. Onufryk Integrated Device Technology, Inc.
The objective of this third volume on network processor (NP) design is the same as that of its predecessors [1, 2]: to survey the latest research and practices in the design, programming, and use of network processors. As in the past, network processor is used here in the most generic sense and is meant to encompass any programmable device targeted at networking applications, including: application-specific processors such as those used in programmable search, inspection, security, and traffic management applications; RISC-based devices found in consumer networking equipment, such as DSL and cable gateways; and high-performance microcoded chip multiprocessors such as those found on core router line cards. While the cost, power, and performance requirements of these applications may differ, they share the common thread of programmable packet processing. The main theme of the first volume was meeting the performance challenges of high-speed networking. The primary goal was to replace fixed-function ASICs with reprogrammable processors with the promise of allowing system vendors to adapt to evolving network protocols and applications as well as improving time-to-market. Fueled by the dot com and telecommunications bubbles, a large number of start-ups as well as several large, established companies entered the market. With relatively little theoretical foundation, industry led the way in the development of network processor architectures. The result was a diverse set of architectures that often ignored years of research and experience in the architecure and programming of parallel processors. The main theme of the second volume shifted from the sole goal of meeting the real-time performance required for high-speed networking to other aspects,
2
1
Network Processors: New Horizons
such as ease of programming and application development. As equipment vendors evaluated network processors, they quickly realized that a rush to market caused many network processor vendors to focus on performance with little thought being given to how these devices would be programmed or evaluated. While most vendors announced an elaborate development and simulation environments, the fact remained that these devices were often complex chip multiprocessors with highly specialized hardware accelerators that had to be programmed at the mirocode level. In addition, while benchmarking methodologies emerged, they were often highly system- and application-dependent and offered limited insight into how a device would actually perform in a vendor’s particular situation. The themes in this third volume are similar to that of the second volume. The goal of achieving ever increasing levels of packet processing performance has given way to other factors such as ease of programming, application development, power, and performance prediction. In fact, many vendors have repositioned their products toward lower-performance, higher-volume markets such as DSL access multiplexors (DSLAMs), cellular base stations, and cable-modem termination systems (CMTS). In addition, new opportunities for programmable packet processors have emerged in areas such as networked storage, TCP offload engines, and security. Since the market for high-end network processors anticipated during the dot com and telecommunications bubbles never materialized, the industrial landscape has seen dramatic changes since the writing of the first volume [1]. Faced with little opportunity for revenue and virtually none for acquisition, most startups formed during the gold rush years have shut their doors. In addition, a number of established companies have either exited the market entirely or discontinued future development. The remaining vendors have shifted their focus toward evolving current architectures rather than developing new ones. Thus, as has occurred in the past with RISC architecures and parallel processing, the rapid industrial development and subsequent consolidation have opened the door for academic research to lead the way in the formation of a theoretical foundation for the architecture, evaluation, and programming of network processors. Unlike Volumes 1 and 2, this book has only one part and is entirely focused on the latest research in the design, programming, and use of network processors. Our goal has been to emphasize forward-looking and sometimes theoretical leading-edge research rather than the incremental enhancements of existing architectures. Conceptually the contributions in this book fall roughly into three domains: architecture; tools and techniques;
1.1
Architecture
and applications. The remainder of this introduction reviews this book’s contributions in these areas.
1.1
ARCHITECTURE As discussed, the rapid pace of industrial innovation and product announcements have declined. Instead, vendors have focused on evolving and enhancing their existing products. Examples of this include OC-192 announcements by companies such as AMCC [3], EZ Chip [4], and Intel [5]. While economic conditions have slowed the introduction of commercial network processing architectures and ideas, academic researchers have continued to develop new techniques and approaches. As the deployment of network processors in application areas beyond traditional router line cards has continued, it has been observed that the limited control store found in many network processor architectures has presented challenges in more complex applications. In Chapter 2, segmented instruction caches for multithreaded processors are proposed. Such a cache is shared by real-time and nonreal-time threads with real-time threads being mapped and preloaded into segments large enough to avoid misses, while nonreal-time threads being mapped to the segments with normal cache behavior. Packet classification forms an important component of virtually all packet processing applications. Chapter 3 presents the challenges associated with packet classification at high line rates. The authors propose a novel cache design, called digest cache, which trades accuracy for speed. They argue that it is possible to keep misclassifications extremely low while, at the same time, providing performance unobtainable by an exact cache. The digest cache keeps a hash of portions of the header (the flow identifier), rather than the complete flow identifier, significantly reducing storage area and increasing speed. Many of today’s network processors are based on a chip multiprocessor (CMP) architecture in which packet-processing tasks may be pipelined on multiple processors. Such a capability raises the question of how to assign tasks to the pipeline stages. In Chapter 11, a heuristic approach is proposed, called GreedyPipe, which performs near-optimal task-to-stage assignment, even in the presence of multiple flows and multiple-pipeline environments. The possibility of using a stream architecture for packet processing is explored in Chapter 7. Originally designed for running media applications, the stream architecture exploits single-instruction stream, multiple-data stream (SIMD) type
3
1
4
Network Processors: New Horizons
parallelism. A modified version of the stream architecture is explored where packets are streams and complex operations are performed on them in a SIMD fashion.
1.2
TOOLS AND TECHNIQUES A challenge in deploying network processors continues to be performance prediction and programming. Network processors, which often consist of a collection of heterogeneous compute and memory resources with varying degrees of flexibility, require a different set of tools and techniques than the ones used in traditional programming models. Added to the traditional software engineering challenges of correctness, flexibility, and productivity are the novel challenges associated with programming parallel systems to meet real-time performance targets. Traditional tools and methods are typically insufficient. This volume includes four chapters that deal with this issue. Chapter 13 presents a methodology for mapping tasks to RISC cores, coprocessors, and hardware accelerators. An algorithm called maximum local ratio cut (MLRC) clusters instructions according to data and control dependencies. The resulting annotated, directed, acyclic graph can be used for network processor design to determine matching hardware. One can also use the same graph to map and schedule the graph nodes by using a heuristic that uses node criticality as a metric. In Chapter 12, a modeling and scheduling technique for software and hardware is explored. The authors present a framework that simplifies multiprocessor system-on-chip (SoC) design by estimating resource consumption and optimization strategies for delay or energy per packet. Thus, the tool is usable for architectural exploration in mobile and wired applications to compare different designs and identify bottlenecks. Another comprehensive software development approach, called ShangriLa, is presented in Chapter 8. As shown in Figure 1.1, the environment provides the programmer with a domain-specific language, Baker, a compiler suite that is profile directed, and a run-time system that monitors performance and power consumption to adjust resource allocation to meet the respective targets. A middleware platform and application analysis methodology is presented in Chapter 9. Real-time calculus is used to obtain an analysis model for an application, such as VoIP, in terms of input scenarios and resource usage. A real-time network operating system (RNOS) then allows direct implementation of the application using the analysis models with guaranteed real-time behavior.
1.3
Applications
5 Baker
Debugger
Profiler
Pipeline(x)compiler
Aggregate compiler
System model
Run-time system
Network system hardware
1.1
The Shangri-La environment (from Chapter 8).
FIGURE
1.3
APPLICATIONS As application areas for network processors continue to expand beyond traditional router line cards, there exists a growing trend toward more stateful packet processing, driven by applications, such as network security and TCP processing. A continuation of this trend will undoubtedly affect commercial offerings as most current architectures offered today are optimized for the largely stateless processing found on line cards. In Chapter 6, the design of a high-speed intrusion detection and prevention system is described. A hardware network node intrusion detection system (NNIDS) is proposed that attaches between the host and the network to prevent not only incoming attacks, but also outgoing attacks. That is, even if the host is compromised, NNIDS continues to function. The system hardware consists of a network processor and an FPGA and runs the Snort network intrusion software package. The network processor runs a pipelined multithreaded code that performs filtering, IP defragmentation, and TCP reassembly. In parallel, computationally-intensive pattern matching is performed by the FPGA. Faster line rates and increased use of TCP have renewed interest in providing greater support for TCP/IP processing. Chapter 5 presents a topical application area, a hardware implementation of a TCP offload engine (TOE) that can keep
1
6
Network Processors: New Horizons
pace with future requirements. The device is programmable and thus able to adapt to protocol changes, yet still achieves high processing rates at low power consumption. The main components of the device are a high-speed, multithreaded processing block, a scheduler for thread control, a large segmented transmission control block to keep the TCP connection context, and a DMA controller for payload transfers. A number of new interconnect standards have emerged for networking and computing. Chapter 4 investigates the use of a network processor to implement three of these emerging standards: RapidIO, Hypertransport, and PCI-Express. The chapter models the three interfaces and explores the idea of implementing them in software on a multithreaded packet-processing engine, such as that found in a network processor. The authors indicate that such an approach could allow a soft implementation of an interconnect interface to execute alongside the main, application-specific processing performed by the network processor. This would eliminate the need for fixed function interface hardware. Finally, Chapter 10 investigates the use of network processors for an application that is not in the traditional domain of communications: DNA queries. The authors of the chapter observe that there are similarities between
SDRAM
circular buffer of packets (containing CMA nucleotides) 0
1
CTAAGGT
2
3
4
5
6
7
8
9
ATGCAA AAGRCA CCCGTA GATTAC ACGTAA CCGATT GGACGA TTAGTA
ACC
3 1 NE0
NE1
NE2
NE3
NE4
NE5
Receive PFIFO
Network
2a Abo-Corasick
Query (e.g., "ACCTAACCCATTGGA ...")
2b Scratch
"Trie"
IXP1200
a c g t
SRAM
1.2 FIGURE
DNA query processing (from Chapter 10).
References
7 scanning a very large gene database for a specific sequence and inspecting the content of packets for “signatures” in an intrusion-detection problem. In their implementation, the large DNA database is sent to a multithreaded network processor based system in packets via a network connection and compared with the query that is kept on the network processor board (see Figure 1.2). Results show that the parallel processing capability of a low clock frequency network processor performs comparably with an implementation running on a computer at a much higher frequency.
1.4
CONCLUSIONS This volume presents the latest research on network processor architectures, hardware, modeling, software environments, and application development. While considerable progress has been made in this field of study, it is far from mature. New application areas, such as network security and TCP offload, are challenging the basic architecture of current network processors. Continued opportunity exists for enhancing tools, environments, and techniques for programming and performance estimation of these devices. In the few short years that the field has existed, we have already seen the application domain for network processors move far beyond traditional router line-card applications. The rapid convergence of networking and computing is likely to challenge the fundamental existence of stand-alone, general-purpose network processors as we know them today. It is quite likely that concepts and techniques used in today’s network processors will find their way into future general purpose CPUs, such as those found in workstations and servers. As the convergence in LAN, storage, cluster, and server interconnects continues, it is also likely that the concepts and techniques used in today’s network processors will find their way into deeply embedded application-specific devices such as network adaptors, switches, and peripherals. Finally, networking speeds will continue to grow and raw performance will continue to be an important component of network processor design. Thus, we believe that the opportunities for research and innovative new products in this field are greater today than they have ever been in the past.
REFERENCES [1]
P. Crowley, M. Franklin, H. Hadimioglu, and P. Onufryk, Network Processor Design: Issues and Practices, Volume I, Morgan Kaufmann Pub., San Francisco, CA., 2002.
1
8
Network Processors: New Horizons
[2]
P. Crowley, M. Franklin, H. Hadimioglu, and P. Onufryk, eds, Network Processor Design: Issues and Practices, Volume II, Morgan Kaufmann Pub., San Francisco, CA., 2003.
[3]
Applied Micro Circuits Corporation, nP7510 Network Processor, www.amcc.com.
[4]
EZchip Technologies, NP-1c Network Processor, www.ezchip.com/html/in_ prod.html.
[5]
Intel Corporation, Intel IXP 2800 Network Processor, www.intel.com/design/network/products/npfamily/ixp2800.htm.
2 CHAPTER
Supporting Mixed Real-Time Workloads in Multithreaded Processors with Segmented Instruction Caches Patrick Crowley Department of Computer Science and Engineering, Washington University in St. Louis
Program size limitations are a hindrance in current network processors (NPs). NPs, such as the Intel IXP [1] network processor family, consist of two types of on-chip programmable microprocessors: a control processor, which is a standard embedded microprocessor (the ARM-based XScale), and multiple data processors, which are microprocessors designed to efficiently support networking and communications tasks. The control processor is used to handle device management, error handling, and other so-called slow-path functionality. The data processors implement those packet processing tasks that are applied to most packets, that is, the fast-path functionality, and are, therefore, the key to high performance. In current NPs, each data processor executes a program stored in a private on-chip control store that is organized as a RAM. This fixed-size control store limits program size (e.g., the IXP2800 control store has 4 K entries), which consequently both limits the variety of applications implementable on the NP and complicates the programming model. In this chapter, we describe and evaluate a technique that removes this limitation without violating any of the tacit NP requirements that inspired the limitation in the first place. Two questions arise immediately: Do NP data processors need to execute large programs, and, if so, why are instruction caches not used? The answer to the first question can be seen in the large number of application areas in which NPs are being deployed: traditional protocol and network processing systems such as wired and wireless routers, gateways, and switches; nontraditional networking tasks such as web-server load balancing, virus and worm detection and prevention, and denial-of-service prevention systems; and other applications
10
2
Supporting Mixed Real-Time Workloads
including cell-phone base stations, video distribution and streaming, and storage systems. This broad deployment is driven by the considerable processing and I/O bandwidths available on NPs, which are organized as heterogeneous chipmultiprocessors and therefore better able to exploit coarse-grain parallelism than are general-purpose processors. On the other hand, further deployment is hindered by program size limitations and the associated increase in software engineering costs. As for the second question, while instruction caches are used in all generalpurpose processors and most embedded processors, NPs have at least two characteristics that complicate their use. First, networking systems are typically engineered to handle worst-case conditions [2]. This leads to real-time requirements for programs executing on NPs that are designed to meet worst-case specifications. Caches, on the other hand, are used to speed up average-case performance (and, in the worst case when all references miss, can actually lower performance). Since caches rely on locality, and locality may not be present under worst-case conditions, their inclusion cannot always be justified if a system is designed for worst-case conditions. Second, NP data processors typically feature software-controlled, hardware-supported, coarse-grained multithreading. The presence of multithreading complicates instruction delivery in a processor, particularly while trying to provision a system for a certain performance target. In this chapter, we propose the use of a segmented instruction cache along with profile-driven code scheduling to provide flexible instruction delivery to both real-time and nonreal-time (i.e., best-effort) programs, even when those programs are executing on the same multithreaded processor. Our proposed segmented instruction cache partitions the instruction cache (via a thread-specific address mapping function) such that each thread is allocated a fixed number of blocks in the cache and threads cannot conflict with one another. In this way, realtime programs can be tailored to fit into their assigned slots completely to avoid misses, while best-effort code, which can tolerate misses, can map larger, more complex control flow structures into their own segments without impairing the instruction delivery of other threads. To improve the performance of best-effort code, profile-driven code scheduling is used to reduce conflict misses. Profiling is also used to explore a variety of program-specific segment sizing strategies. The proposed design is evaluated on a selection of programs running on both direct-mapped and set-associative segmented instruction caches. The remainder of this chapter is organized as follows. Section 2.1 describes how programs are loaded and executed in NP data processors, and describes how a standard instruction cache might be used. Section 2.2 introduces the segmented instruction cache. An experimental evaluation of the proposal is
2.1
Instruction Delivery in NP Data Processors
11
presented in Section 2.3 and considers several benchmark programs, segment sizing, and reducing miss rates with profile-driven code scheduling and setassociativity. Section 2.4 presents related work; the paper ends with conclusions and future work in Section 2.5.
2.1
INSTRUCTION DELIVERY IN NP DATA PROCESSORS Network processors, like most embedded processors and unlike most generalpurpose processors, have rich compile-time knowledge of which programs and threads will be active at runtime. In fact, on the Intel IXP the source code for all the programs in all the threads to be run on a data processor is compiled together to form a single binary instruction image, which is loaded by the control processor into the data processor’s control store when the system boots. In the following subsections, we discuss how the the program is mapped into the control store, and how a cache might be used in place of a RAM.
2.1.1
Fixed-Size Control Store Fixed-size control stores are essentially random access memories (RAMs) containing instructions. Figure 2.1 illustrates an example with three programs, totaling 21 instructions in length, and a RAM with 18 memory locations. In
(b)
(a) Programs
2.1 FIGURE
RAM
Programs
(c) Cache
Programs
Segmented cache
Illustrations mapping three programs from three distinct threads into (a) a fixedsize RAM, (b) a cache, and (c) a segmented cache, all of which contain 16 entries. The fixed-size RAM cannot accommodate the third program. The cache can, but the mapping causes conflicts between the third and first programs. The segmented cache restricts conflicts to the segment assigned to program three.
2
12
Supporting Mixed Real-Time Workloads
the example, only the first two programs can fit; the third must be mapped to another data processor with enough available control store entries. Note that this information is known at compile time: when the programmer compiles her program, it will be known that all three programs cannot fit. Since the control store is fixed, the total size of threads allocated to a single data processor has a hard upper bound. One consequence is that some programs are simply too big to fit in a single control store. In this case, the programmer must reduce the program size via optimization (if it is possible and it represents an acceptable performance trade-off) or decompose the program into fragments to be mapped to multiple processors. Decomposing a task into balanced steps and mapping those tasks to multiple processors is a form of software-pipelining; this can be, and indeed is, done as a performance enhancement even when the control store size is not a consideration. However, while certain algorithms and programs lend themselves to balanced pipeline implementations, the efficient pipelining of general-purpose code is difficult. Thus, a method for flexible instruction delivery can alleviate the need to fragment a program when it is inconvenient or unnecessary to do so.
2.1.2
Using a Cache as a Fixed-Size Control Store As previously mentioned, a cache can alleviate program size restrictions. Furthermore, a cache can be used as a fixed-size control store provided that the aggregate program size does not exceed the cache capacity and that code is laid out contiguously at the start of the address space. Given a cache with N entries, or sets, and an address a, then a mod N yields the entry address to which a maps. Thus, if the total number of instructions is less than N, then each will map to its own set. Given this scenario, the cache can be accessed without incurring misses, as follows. First, it is necessary to preload the cache with instructions as is done in the fixed control store case. Now, consider the three possible sources of uniprocessor cache misses [3]: compulsory, capacity, and conflict. Given that the program fits in the cache: preloading avoids compulsory misses, there will be no capacity misses, and conflict misses can be avoided by laying out the code sequentially in memory without any gaps. In this way, a cache can be used to implement a fixed-size RAM. Note that no change is required in the programming model or compiler structure. However, a cache does involve some overhead in both access time and size; we return to these topics in Section 2.2.2. Now, suppose one of our threads is not involved in providing stable service under worst-case conditions. In this case, our program can tolerate some
2.2
Segmented Instruction Cache
13
instruction cache misses during execution. However, as illustrated in the second part of Figure 2.1, if the program size exceeds the contiguous unused portion of the cache, it will conflict with another thread (which may not be able to tolerate misses). One way around this is to use cache line pinning to keep high-priority code from being evicted. When a cache line is pinned, it will not be replaced on a miss. This works well to keep important code in place, but always causes misses in the lower-priority conflicting code. There are other approaches to consider, and we discuss these in Section 2.4. As an alternative that does not require these unavoidable conflicts, we propose the segmented instruction cache.
2.2
SEGMENTED INSTRUCTION CACHE In a segmented instruction cache, each thread is assigned a segment, and this provides two key benefits beyond the unrestricted program sizes afforded by caches. First, real-time programs that cannot tolerate misses can be allocated segments equal to their program size; this is equivalent to the fixed control store case. The second benefit is that threads are insulated from one another and cannot conflict; most notably, real-time segments cannot be disturbed by other threads. This situation is illustrated in the third part of Figure 2.1. To implement segments, we modify the cache mapping function to include a thread-specific segment size and offset. In other words, rather than having a constant N entries, the cache is seen to have N(t) entries by thread t, and offset(t) indicates the position of the first set in the segment. Thus, the address mapping function becomes: (a mod N(t)) + offset(t),
(2.1)
where a is the address and N(t) and offset(t) are thread t’s segment size and offset within the cache, respectively. Note that there is no change in programming model for a real-time program. The program gets compiled; the required segment size is equal to the program size and thus is known. The program must be mapped to a data processor with a sufficiently large segment available. In fact, one would probably map all real-time programs on a given data processor to the same segment, and assign unique segments only to nonreal-time code (since, as we will see, interthread conflicts can be problematic). The data processor can now support arbitrarily large nonreal-time programs. For such programs, however, we must determine a segment size.
2
14
2.2.1
Supporting Mixed Real-Time Workloads
Segment Sizing Strategies When choosing a segment size, we first determine if a choice is available. It may be the case that other considerations dictate that a given nonreal-time program must run in a particular amount of available space (e.g., whatever remains once the real-time code has been allocated). In this case there is no decision. On the other hand, there may be some choice in the size of the allocation; in some fashion, a number of blocks needs to be determined. For example, there may be multiple nonreal-time programs to be mapped to a particular data processor and appropriate segment sizes must be chosen. Accordingly, consider the following three segment sizing strategies. ✦
Program. Choose a segment size equal to the number of instructions in the program. No possible execution will result in a miss. This is the strategy used for real-time code.
✦
Profile. Choose a segment size equal to the number of unique references seen during a profiled execution run. In this case, misses will occur only when program execution takes place outside the profiled paths.
✦
Locality. Choose a segment size equal to some fraction of the unique references seen during profiling. It is frequently the case that a small fraction of unique references account for a large majority of the dynamic references. In this case, misses are possible but will be few if the profiled data is representative to real executions. In this chapter, we will consider fractions representing 20, 40, 60, and 80 percent of the total unique references seen during profile runs.
The profile and locality strategies use execution profiling to determine what instructions are likely to be executed. Note that this is an incomplete methodology for real-time code, but is sufficient for best-effort code in which we seek to improve the expected case. As will be seen, we can also use profile information to intelligently lay out code to reduce conflict misses.
2.2.2
Implementation We now consider the implementation details of a segmented cache. Figure 2.2 shows the organization of a typical direct-mapped cache augmented with a unit, surrounded by a dashed rectangle, that performs segmentation. Certain bits are extracted from the address and are used to index into the tag and data arrays. Tags are compared to make sure the data entry corresponds to the desired
2.2
Segmented Instruction Cache
15
Block address <18> <11> Tag Index
Block offset <1>
Segment and offset Valid bit Tag (2048 blocks) <22> <1>
Data (2048 blocks) <64>
=?
2.2 FIGURE
A sample cache organization with 32-bit addresses, 2K entries, 8-byte blocks, and a minimum segment size of 128. The segmented cache implementation requires a transformation only on the index bits.
address rather than some other address that maps to the same set. In this implementation, a portion of the index is stored along with the tag to allow a range of segment sizes no smaller than 128 entries. Fewer bits would be needed to support a larger minimum segment size; similarly, more index bits would need to be stored to support a smaller minimum segment size. As can be seen, only a transformation in index bits is required to implement segmentation. Segment sizes and offsets can be stored in each thread’s status registers. At context switch time, the segment size and offset of the incoming thread can be loaded into the cache mapping unit. As mentioned earlier, a cache of any kind incurs both access time and size overheads compared to a RAM. In our case, the access time overhead is due to both the index bit transformation and tag comparison. Whether this increased access time is on the critical path and reduces clock rate is a design-specific consideration. If it were (and we would expect it to be on a simple data processor), an extra processor pipeline stage could be added in order to keep from hurting the clock rate (the IXP1200 data processor, for example, uses a five-stage pipeline and this scheme might require a sixth stage). The size overhead is due to the tag
2
16
Supporting Mixed Real-Time Workloads
array; each entry requires some number of tag bits not needed in a RAM. For the example shown in Figure 2.2, with a minimum segment size of 128 entries, each 64-bit data entry requires a valid bit and a 22-bit tag entry, which results in a bit count overhead of approximately 36 percent as compared to a RAM, which incurs no overhead. Whether these overheads are worthwhile are design-specific decisions depending on a wide range of factors. Given the dominance of caches in generalpurpose and embedded systems, we expect that future NP data processors will pay these overheads in order to gain the benefits of unrestricted program size.
2.2.3
Address Mapping If all segment sizes were a power of 2, then the mapping function would be trivial: extract the index bits and use those to index into the tag and data arrays (as is done with normal caches, all of which are sized to be a power of 2); this is a highly efficient implementation of the modulo operator. For arbitrary segment sizes, however, another approach must be taken since the modulo operation requires an integer division, rather than a simple bit selection, when the modulus is not a power of 2. To illustrate the problem, first consider a segment with eight entries. The mod operation can be implemented by simply using the three least significant bits of an address as an index to select one of the 23 = 8 sets. Now consider a segment with seven entries. Three bits are still needed to encode the seven entries, but one of the 3-bit combinations will not correspond to a valid entry (i.e., there are only seven entries but eight possible 3-bit sequences). In fact, the invalid entry will map to an entry in another segment, a situation we certainly want to avoid. One solution implements digital division and calculates the remainder directly. For our purposes, however, a simpler approach is preferable. The following pseudo-code implements the calculation, where N is the segment size and the selected bits are the bits taken from the address that encode the address (the number of bits needed is equal to the number of bits needed to encode the segment size, i.e., log2 N, and can be different for each thread). if selected-bits >= N set = selected-bits − N + offset else set = selected-bits + offset The idea is to detect an invalid bit sequence (i.e., one that is greater than the segment size) and generate a valid bit sequence from it. The computation requires adders and a comparator, but is parallel and should permit fast implementation.
2.3
Experimental Evaluation
2.2.4
Enforcing Instruction Memory Bandwidth Limits Since the path to memory where the rest of the program is stored (i.e., from which misses are serviced) is likely to be a resource shared by real-time activities (e.g., data or I/O accesses), it might be necessary to limit interference from instruction cache misses. There are two considerations. The first is to keep the required instruction memory bandwidth at a level sufficient to meet any realtime constraints. The second task is to keep the required instruction memory bandwidth deterministic so that adequate provisioning of system resources can be performed. To both of these ends, one could employ an instruction fetch throttle that determines the bandwidth allocated to instruction fetches. In this way, the instruction memory request rate can be set by the system and thereby bounded for the purpose of system provisioning. One policy, for example, would be to give instruction access priority over all other memory transactions. The actual amount of instruction memory bandwidth allocated will be a system-level question, while the bandwidth required will be a consequence of program size, execution pattern, thread scheduling pattern, and region size. We plan to explore mechanisms to support such policies in future work.
2.3
EXPERIMENTAL EVALUATION In the following sections, we evaluate the segmented instruction cache and demonstrate effective sizing strategies and miss-reduction techniques.
2.3.1
Benchmark Programs and Methodology Our performance simulations are trace-driven. We use several programs that have been made publicly available by researchers [4, 5]; these are described in Table 2.1. Most of the programs are typical of networking codes: searching, sorting or extracting values, data validation, and encryption. The rest are numeric in nature (e.g., FFT and ludcmp). All are C-based programs and were compiled on a Sun Solaris 8 machine with GCC version 2.95.1. The instruction traces were gathered by running Gnu gdb version 4.15.1. To produce the experimental cache results reported here, we used our own cache simulator, which has been validated against the DineroIV [6] cache simulator. The simulator is implemented in Python [7] and is easily extensible and can flexibly model a wide variety of memory system features and organizations.
17
2
18 Program
TA B L E
2.3.2
Dyn. I.C.
Unique Dyn. I.C.
Description
binsearch
199
175
79
Binary search over 15 integers.
chk_data
99
295
66
Example from Park’s [8] thesis that finds the first nonzero entry in an array.
CRC
442
71633
258
Cyclic redundancy check.
DES
1058
110217
929
Data encryption standard [9].
FFT
835
3915
305
Fast Fourier transform.
fibcall
168
557
48
Sum the first 60 Fibonacci numbers.
isort
209
2215
89
Insertion sort of 10 integers.
ludcmp
761
8851
631
LU decomposition of linear equations.
matmul
274
8707
158
Matrix multiplication.
qsort
574
2649
426
Quicksort.
qurt
492
1714
365
Finding roots of quadratic equations.
select
520
2959
367
Select k largest integers from a list.
tstdemo
2.1
Static I.C.
Supporting Mixed Real-Time Workloads
2558
1.67M
2010
Using ternary search trees [4] to find all words in a 20K-word dictionary that are within a Hamming distance 3 of “elephant.”
Sample programs discussed in this study.
A custom simulator was necessary since our cache explorations involve novel (e.g., nonuniform cache mapping) and atypical (e.g., cache set counts that are not a power of 2) features. In all experiments, a line size of one word (32 bits, i.e., one instruction) is used.
Segment Sizing In this section, we investigate the effectiveness of segment-sizing strategies. Recall, that the program sizing-strategy will yield no cache misses at all, as
2.3
Experimental Evaluation
Program
19 Program
Profile
Locality 20
binsearch chk_data
16
32
48
80 64
95 76
99
66
14
27
40
53
63
442
258
52
104
155
207
246
DES
1058
929
186
372
558
744
883
FFT
835
305
61
122
183
244
290
fibcall
168
48
10
20
29
39
46
isort
209
89
18
36
54
72
85
ludcmp
761
631
127
253
379
505
600
matmul
274
158
32
64
95
127
151
qsort
574
426
86
171
256
341
405
qurt
492
365
73
146
219
292
347
tstdemo
TA B L E
79
60
CRC
select
2.2
199
40
520
367
74
147
221
294
349
2558
2010
402
804
1206
1608
1910
Sample segment sizes.
will profile provided that the profiling inputs are correct; the former will be used from programs that cannot tolerate misses, and the latter can be used for programs that should not miss but can afford to do so occasionally. The locality strategy will, by definition, incur some misses. It is the most optimistic and anticipates a high amount of locality in the reference stream. Thus, in our cache experiments, only the locality strategy will be examined. Table 2.2 reports program-specific segment sizes for each of these strategies. While gathering good profiles is an important task in practical software engineering, it is outside our interest in this study. We use the profile inputs provided with each benchmark for our evaluations. Note that our emphasis is not in measuring the accuracy of the profiles, but rather in determining the performance of the segmented instruction cache given good profile information. To gain initial intuition about how much locality is present, we first consider the distribution of instruction references in our profile runs. Figure 2.3 depicts the distribution of dynamic instruction references over static instructions. For example, the bottom-most stacked segment in the binsearch column indicates that 10 percent of the unique static instructions seen during execution account for 20 percent of the dynamic references. The most extreme example is tstdemo, in which 95 percent of dynamic instructions are caused
2
20
Supporting Mixed Real-Time Workloads
Cumulative distribution of dynamic over static instructions
% of static instructions
100 80 60 40 20
ch
bi
ns
ea
rc k_ h da ta CR C D ES FF T fib ca ll iso lu rt dc m m p at m ul qs or t qu rt se le tst ct de m o
0
20%
40%
60%
80%
95%
100%
This graph reports the percentage of static instructions seen during execution that are responsible for 20%, 40%, 60%, 80%, and 95% of the dynamic instructions observed in a profile run.
2.3 FIGURE
by fewer than 5 percent of the static instructions. On average, however, it appears that around 60 percent of the static instructions account for 95 percent or more of the dynamic references. This suggests that if we can cache those 60 percent and keep them from conflicting with one another, the remaining 40 percent cannot impose too many misses if they are not kept in the cache. Table 2.3 reports miss rates for the locality segment-sizing strategy, both before and after preloading, and categorizes the misses for the preloaded miss rate. There are several points to be made: ✦
Miss rate improves with increasing segment sizes.
✦
Miss rate improves, as expected, with preloading.
✦
Miss rates (even with preloading) are quite high (e.g., 14 percent or higher with locality 60) for five of the programs: binsearch, FFT, qsort, qurt, and select. Around 25 percent of misses for those programs are due to conflicts.
2.3
Experimental Evaluation
Program binsearch
chk_data
CRC
DES
FFT
fibcall
isort
ludcmp
2.3 TA B L E
Locality
21 Miss rate
With Preloading Miss rate
Compulsory
Capacity
Conflict
20
0.98
0.89
0.40
0.60
0.00
40
0.84
0.66
0.41
0.59
0.00
60
0.50
0.23
0.78
0.00
0.23
80
0.45
0.09
1.00
0.00
0.00
20
0.94
0.89
0.20
0.80
0.00
40
0.42
0.33
0.40
0.00
0.60
60
0.22
0.09
1.00
0.00
0.00
80
0.22
0.04
1.00
0.00
0.00
20
0.39
0.39
0.01
0.93
0.06
40
0.17
0.17
0.01
0.00
0.98
60
0.00
0.00
0.69
0.11
0.19
80
0.00
0.00
0.76
0.12
0.12
20
0.16
0.16
0.04
0.34
0.62
40
0.03
0.03
0.20
0.01
0.78
60
0.01
0.00
0.89
0.00
0.11
80
0.01
0.00
0.84
0.00
0.16
20
0.74
0.73
0.09
0.91
0.00
40
0.64
0.61
0.08
0.40
0.52
60
0.19
0.14
0.22
0.50
0.28
80
0.10
0.03
0.46
0.00
0.54
20
0.90
0.88
0.08
0.92
0.00
40
0.09
0.05
1.00
0.00
0.00
60
0.09
0.03
1.00
0.00
0.00
80
0.09
0.02
1.00
0.00
0.00
20
0.93
0.93
0.03
0.97
0.00
40
0.48
0.47
0.05
0.91
0.04
60
0.07
0.05
0.34
0.66
0.00
80
0.04
0.01
1.00
0.00
0.00
20
0.19
0.17
0.33
0.67
0.00
40
0.09
0.06
0.74
0.24
0.02
60
0.09
0.05
0.60
0.00
0.40
80
0.07
0.01
1.00
0.00
0.00
Cache results for sample segment sizes.
2
22 Program matmul
qsort
qurt
select
tstdemo
2.3
Locality
Supporting Mixed Real-Time Workloads
Miss rate
With Preloading Miss rate
Compulsory
Capacity
Conflict
20
0.88
0.88
0.02
0.98
0.00
40
0.19
0.18
0.06
0.41
0.52
60
0.13
0.12
0.06
0.00
0.94
80
0.13
0.12
0.03
0.00
0.97
20
0.53
0.50
0.26
0.73
0.01
40
0.48
0.42
0.23
0.76
0.01
60
0.42
0.33
0.20
0.80
0.00
80
0.30
0.17
0.18
0.37
0.44
20
0.60
0.56
0.31
0.30
0.40
40
0.33
0.25
0.52
0.48
0.00
60
0.31
0.18
0.48
0.35
0.18
80
0.23
0.06
0.68
0.02
0.30
20
0.89
0.86
0.12
0.88
0.00
40
0.65
0.60
0.12
0.88
0.00
60
0.30
0.23
0.21
0.12
0.67
80
0.20
0.10
0.25
0.00
0.75
20
0.03
0.03
0.03
0.05
0.91
40
0.03
0.03
0.03
0.01
0.97
60
0.00
0.00
0.30
0.03
0.67
80
0.00
0.00
0.25
0.00
0.75
Continued
TA B L E
Recall that of the three types of uniprocessor cache misses, conflicts are the only type we can manipulate further for a given segment size (some compulsory misses are avoided via preloading, and capacity misses are determined by segment size). Since they can be manipulated, we discuss the sources of conflicts and techniques for reducing them in the following sections.
2.3.3
Sources of Conflict Misses Cache conflict misses can result when two or more addresses map to the same set; this can be called a spatial conflict. Spatial conflicts are a necessary but insufficient
2.3
Experimental Evaluation
condition for conflict misses: there must also be a temporal conflict between the addresses (i.e., the references must be interleaved with one another over time). If the addresses are not referenced near to one another in time, then few, if any, conflict misses will result. In a multithreaded processor, there are two possible sources of conflict misses: intrathread and interthread. Intrathread conflicts arise when addresses within one thread conflict. Interthread conflicts are possible when different threads share a segment and their address request patterns conflict; we consider segment sharing like this in Section 2.3.6. If we know in advance which paths are most likely to be executed, then it is possible for us to schedule our program into memory in such a way as to minimize spatial conflicts between instructions on the frequently executed paths; we consider this topic in the following section. There are also well-known hardware techniques for reducing conflict misses as well, and we consider these in Section 2.3.5.
2.3.4
Profile-Driven Code Scheduling to Reduce Misses As can be seen in Figure 2.3, a fraction of static instructions are responsible for a majority of the dynamic references. By using profile information, we can identify those instructions that occur most frequently and position them in memory, so that they conflict with other infrequently executed (or ideally nonexistent) instructions. This general technique is called code-scheduling or code-reordering and is well studied [10]. Code scheduling is a particularly appropriate technique for NP programs because so much is known and statically fixed at compile time. Using this approach, we can schedule the code in our benchmark programs. The resulting effect on miss rate is shown in Table 2.4. There are, again, several points to be made: ✦
Miss rates always decrease as segment sizes increase.
✦
All locality 80 preloaded miss rates are at or below 10 percent.
✦
Conflict misses rarely dominate the other types, and only do so at very low miss rates. It appears that most conflicts have been avoided.
Figure 2.4 directly compares the miss rates for the locality 60 segment-sizing strategy, the only strategy we evaluate in the remainder of the paper. All miss rates for that strategy are now below 26 percent with most being below 10 percent.
23
2
24 Program binsearch
chk_data
CRC
DES
FFT
fibcall
isort
ludcmp
2.4 TA B L E
Locality
Supporting Mixed Real-Time Workloads
Miss rate
With Preloading Miss rate
Compulsory
Capacity
Conflict
20
0.99
0.90
0.40
0.60
0.00
40
0.57
0.38
0.70
0.30
0.00
60
0.48
0.21
0.86
0.00
0.14
80
0.47
0.10
0.83
0.00
0.17
20
0.91
0.86
0.20
0.80
0.00
40
0.22
0.13
1.00
0.00
0.00
60
0.22
0.09
1.00
0.00
0.00
80
0.23
0.05
0.93
0.00
0.07
20
0.37
0.37
0.01
0.99
0.00
40
0.00
0.00
0.80
0.19
0.01
60
0.00
0.00
0.87
0.13
0.00
80
0.00
0.00
1.00
0.00
0.00
20
0.16
0.16
0.04
0.37
0.59
40
0.05
0.05
0.10
0.01
0.89
60
0.04
0.03
0.10
0.00
0.90
80
0.01
0.01
0.31
0.00
0.69
20
0.56
0.55
0.11
0.88
0.00
40
0.34
0.31
0.15
0.82
0.03
60
0.14
0.09
0.34
0.25
0.40
80
0.09
0.03
0.56
0.00
0.44
20
0.90
0.88
0.08
0.92
0.00
40
0.09
0.05
1.00
0.00
0.00
60
0.09
0.03
1.00
0.00
0.00
80
0.09
0.02
1.00
0.00
0.00
20
0.98
0.97
0.03
0.97
0.00
40
0.42
0.40
0.06
0.88
0.06
60
0.05
0.03
0.52
0.24
0.24
80
0.04
0.01
1.00
0.00
0.00
20
0.32
0.30
0.19
0.35
0.46
40
0.14
0.11
0.39
0.41
0.20
60
0.14
0.10
0.29
0.00
0.71
80
0.07
0.02
0.80
0.00
0.20
Cache results after code scheduling.
2.3
Experimental Evaluation
Program matmul
qsort
qurt
select
tstdemo
2.4
Locality
25 Miss rate
With Preloading Miss rate
Compulsory
Capacity
Conflict
20
0.90
0.89
0.02
0.97
0.02
40
0.11
0.10
0.11
0.82
0.07
60
0.03
0.02
0.33
0.02
0.66
80
0.02
0.00
0.94
0.00
0.06
20
0.60
0.57
0.23
0.65
0.13
40
0.51
0.44
0.22
0.73
0.05
60
0.35
0.26
0.25
0.68
0.07
80
0.21
0.08
0.42
0.31
0.28
20
0.38
0.34
0.50
0.50
0.00
40
0.31
0.22
0.57
0.41
0.02
60
0.27
0.14
0.61
0.25
0.15
80
0.24
0.07
0.61
0.23
0.16
20
0.91
0.88
0.11
0.89
0.00
40
0.45
0.40
0.19
0.81
0.00
60
0.17
0.09
0.54
0.34
0.12
80
0.13
0.03
0.78
0.00
0.22
20
0.00
0.00
0.42
0.48
0.10
40
0.00
0.00
0.68
0.18
0.14
60
0.00
0.00
0.72
0.07
0.21
80
0.00
0.00
0.70
0.00
0.30
Continued
TA B L E
Nearly all programs benefit from scheduling, with miss rates reduced by between 10 percent and 60 percent. Two numeric programs, however, DES and ludcmp, have increases of 3 percent and 5 percent, respectively. Both of these programs experience increased conflict misses at this segment size when code is scheduled according to profiled execution frequency.
2.3.5
Using Set-Associativity to Reduce Misses Cache set-associativity is a hardware-based approach for reducing conflicts. In a set-associative cache, each address maps to one set that contains m entries (in an m-way cache); these m entries are accessed in a fully associative manner
2
26
Supporting Mixed Real-Time Workloads
Effect of code scheduling on miss rate locality 60 0.50 Original MR Scheduled MR
0.45 0.40 Miss rate
0.35 0.30 0.25 0.20 0.15 0.10 0.05
ch
bi
ns
ea rc k_ h da ta CR C D ES FF T fib ca ll iso r lu t dc m m at m ul qs or t qu r se t le tst ct de m o
0
Compares miss rates before and after code scheduling with the locality 60 segment-sizing strategy.
2.4 FIGURE
(i.e., the requested item can be found in any of the positions and they must all be searched in parallel). So long as no more than m addresses conflict in each set, conflict misses can be avoided. Associativity is a feature orthogonal to segmentation; in a segmented cache, it doesn’t matter whether a set is a single item or a collection of items. In our next set of experiments, we measure the effect of associativity on conflict misses both before and after code scheduling. The results are shown in Figure 2.5. In the figure, conflict misses for each cache organization are reported as normalized to the number of conflicts seen in a direct-mapped cache of the same capacity. We make the following observations: ✦
Set-associativity alone is ineffective at reducing conflict misses. The first part of Figure 2.5 shows that conflicts actually increase at least as often as they decrease. This counterintuitive result is due to the fact that when associativity increases, so do the number of potentially conflicting addresses that map to a single set; if an increase in associativity brings into the set a more frequent conflicting address, misses can increase. Of course, this is likely to happen only in full caches (e.g., caches that utilize their full capacity), as these caches are by design.
✦
Set-associativity is a clear benefit when used with code scheduling, more than half the time halving the number of conflict misses.
Y FL
2.3
Experimental Evaluation
M A E T
27
Conflicts normalized to direct-mapped
Effect of associativity of conflict misses Original program layout 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0
2
4
8
l t p ul rt rt ct o ch ta C ES FT al r ar _da CR D F ibc iso dcm atm qso qu sele dem e f lu m ns hk tst bi c
Conflicts normalized to direct-mapped
Effect of associativity of conflict misses Scheduled program layout 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0
2
4
8
l t p ul rt rt ct o ch ta C ES FT al r ar _da CR D F ibc iso dcm atm qso qu sele dem e f s k lu m tst in ch
b
FIGURE
Reports the effect of set associativity on conflict misses both before (top) and after (bottom) code scheduling with the locality 60 segment-sizing strategy. Results are normalized to the number of conflict misses seen in a direct-mapped cache; 2-, 4-, and 8-way set-associativities are reported.
2.3.6
Segment Sharing
2.5
As discussed in Section 2.3.3, it is possible, and indeed common, for multiple thread contexts to execute the same program. In this case, it makes sense to assign each context to the same cache segment. While economizing on instruction cache space, this situation raises the possibility of interthread conflicts. In this section, we measure the effect of this type of sharing on miss rate. (Another possibility includes mapping different programs into the same segment for programs to share, but we leave this to future work.)
2
28
Supporting Mixed Real-Time Workloads
Since segments are shared, our simulation methodology must change somewhat. If T threads are sharing a segment, then each of those threads is likely to be at different points of program execution. In our experiments, we begin simulation by assigning each thread a random start location in the instruction trace. Each thread then executes, one at a time in round-robin fashion, until a predetermined number of instructions have been fetched. In these experiments, we limit each thread to 1/Tth of the total number of instructions in the original trace (so that a segment-sharing profile run fetches as many instructions as a stand-alone profile run). In addition to a round-robin scheduling policy (which is the one implemented in the Intel IXP NPs), each thread executes a random number of instructions (sampled from a set of run-lengths normally distributed around seven instructions) before swapping out. Figure 2.6 reports miss rates when segments are shared between multiple threads. Results are shown direct-mapped and four-way caches sharing two, four, and eight threads. We make the following observations: ✦
Miss rates can increase significantly, in the most extreme cases of FFT and ludcmp, nearly doubling.
✦
Set-associativity is always an improvement, but sometimes only negligibly so (e.g., binsearch 2 and 4 threads). Cache results for segment sharing locality 60 0.05 0.45
Sched MR
1W,4T
4W,2T
0.40
1W,2T
1W,8T
4W,4T
4W,8T
Miss rate
0.35 0.30 0.25 0.02 0.15 0.10 0.05
2.6 FIGURE
t le c se
rt qu
t or
m at m
qs
ul
p m
ll
rt
lu dc
iso
T
ca fib
FF
D ES
C CR
k_ ch
bi ns
ea
rc
h
da ta
0
Reports miss rates for direct-mapped and four-way set associative segments when two, four, and eight threads share a given segment when code is scheduled with the locality 60 segment-sizing strategy. Among the legend entries, 1W indicates a direct-mapped segment, and 2T indicates two active threads.
2.4
Related Work
✦
In a few cases (e.g., DES, qsort, qurt), miss rate improves due to constructive interference.
Clearly, the decision to instantiate multiple threads to execute a given program makes sense only if it leads to improved performance. If profiling information were to suggest (as it does for several cases in Figure 2.6) that additional threads might decrease performance, then, indeed, fewer threads should be instantiated. The problem of segment sharing on a segmented cache is basically equivalent to generic cache sharing on multithreaded processors [11], an area of research where some code scheduling work has been done. However, little of that work is applicable here since it is unclear whether it applies to small, full caches such as these; we plan to explore this issue further in future work. For example, it may be that caching only frequently executed instructions (based on profile information) would greatly reduce interthread conflicts.
2.4
RELATED WORK There is a considerable body of work in the independent areas of real-time cache analysis and multithreaded/multiprogrammed cache analysis and design. In the real-time literature, several groups have studied worst-case execution time (WCET) cache analysis [12–14]. Each of these techniques propose different analytical methods for bounding the performance of a given program on a given processor and cache organization. The methods are descriptive and do not prescribe improvements in program structure should the WCET be too great. Instruction delivery in multithreaded processors were first considered in the design of the first multithreaded computers [15] and multicomputers [16]. More recent studies evaluate workstation and server cache performance and requirements [17] on multithreaded architectures, as well as the previously mentioned work on compilation for instruction cache performance on multithreaded processors [11]. Column caching [18] is a dynamic cache partitioning technique in which individual associative ways are allocated for various purposes, such as to threads or programs for guaranteed performance. The drawback is that high associativities are needed for a high degree of allocation. In any case, segmentation, which does not require set-associativity, is orthogonal to column caching and could be used to increase the granularity of allocation. Other approaches [19, 20] have also considered allocating portions of a cache at varying granularities, from individual data objects to processes. The overall aim of these projects is the same as ours—to use distinct cache regions
29
2
30
Supporting Mixed Real-Time Workloads
to provide reliable cache performance—and the code-scheduling techniques we describe here would also work for these other proposals.
2.5
CONCLUSIONS AND FUTURE WORK In this chapter, we proposed the use of segmented instruction caches along with profile-driven code scheduling to provide flexible instruction delivery to realtime and nonreal-time threads running on the same multithreaded procesor. This technique is particularly useful in NP data processors, which are multithreaded yet inhibited by a fixed-size control store. The segmented instruction cache allows real-time programs to map into a private segment large enough to avoid misses while allowing nonreal-time programs to suffer misses while keeping all cache conflicts limited to within individual segments. This removes program size restrictions on nonreal-time code without sacrificing guaranteed instruction delivery to real-time programs. Several program-specific segmentsizing strategies were evaluated, and code scheduling was seen to be an effective method for removing a majority of conflict misses and often reducing miss rates on a selection of programs by a range of 10 percent to 60 percent. We plan to consider a number of additional topics in future work: avoiding index calculations and tag checks (completely for real-time segments, and via speculation otherwise), mechanisms to shape instruction fetch bandwidth, improving shared-segment miss rates, measuring sensitivity to cache parameters (e.g., block size), investigating the benefits of dynamically changing segment sizes at run-time (both for these statically-composed threads as well as dynamically-composed scenarios), providing sharing between segments (e.g., for shared code libraries), and exploring the use of segmentation in data caches.
REFERENCES [1]
Intel Corp., Intel IXP Family of Network Processors, developer.intel.com, 2001.
[2]
V. Kumar, T. Lakshman, and D. Stiliadis, “Beyond best effort: Router architectures for the differentiated services of tomorrow’s Internet,” IEEE Communications Magazine, pp. 152–164, May 1998.
[3]
M. D. Hill, Aspects of cache memory and instruction buffer performance, Ph.D. Dissertation, Tech. Report UCB/CSD 87/381, Computer Sciences Division, UC-Berkeley, November 1987.
References
[4]
[5]
31 J. Bentley and R. Sedgewick, “Fast algorithms for sorting and searching strings,” SODA: ACM-SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms), 1997. C-LAB, Siemens AG, and Universitat Paderborn, C-LAB WCET Benchmarks,
www.c-lab.de, 2003.
[6]
M. Hill and J. Elder, DineroIV Trace Driven Uniprocessor Cache Simulator, www.cs.wisc.edu/markhill/DineroIV, 2003.
[7]
Python, The Python programming language, www.python.org, 2002.
[8]
C. Y. Park, Predicting Deterministic Execution Times of Real-Time Programs, Ph.D. thesis, University of Washington, August 1992.
[9]
National Bureau of Standards, Data Encryption Standards, FIPS Publication 46, U.S. Dept. of Commerce, 1977.
[10]
K. Pettis and R. C. Hansen, “Profile guided code positioning,” Proceedings of the ACM SIGPLAN ’90 Conference on Programming Language Design and Implementation (SIGPLAN ’90), pp. 16–27, June 1990.
[11]
R. Kumar and D. M. Tullsen, “Compiling for instruction cache performance on a multithreaded Architecture,” Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture, pp. 419–429. IEEE Computer Society Press, 2002.
[12]
J. Engblom and A. Ermedahl, “Modeling complex flows for worst-case execution time analysis,” Proceedings of 21st IEEE Real-Time Systems Symposium (RTSS ’00), 2000.
[13]
Y.-T. S. Li, S. Malik, and A. Wolfe, “Cache modeling for real-time software: Beyond direct mapped instruction caches,” Proceedings of the IEEE Real-Time Systems Symposium, 1996.
[14]
G. Ottosson and M. Sjödin. “Worst-case execution time analysis for modern hardware architectures,” ACM SIGPLAN 1997 Workshop on Languages, Compilers, and Tools for Real-Time Systems (LCT-RTS ’97), 1997.
[15]
J. Thornton, Design of a Computer: The Control Data 6600, Scott, Foresman and Co., 1970.
[16]
B. Smith, “Architecture and applications of the hep multiprocessor computer system,” Fourth Symposium on Real Time Signal Processing, pp. 241–248, August 1981.
[17]
Y. Chen, M. Winslett, S. Kuo, Y. Cho, M. Subramaniam, and K. E. Seamons, “Performance modeling for the Panda array I/O library,” Proceedings of Supercomputing ’96. ACM Press and IEEE Computer Society Press, 1996.
[18]
D. Chiou, P. Jain, S. Devadas, and L. Rudolph, “Dynamic cache partitioning via columnization,” Proceedings of Design Automation Conference, Los Angeles, June 2000.
[19]
D. B. Kirk, J. K. Strosnider, and J. E. Sasinowski, “Allocation SMART cache segments for schedulbility,” Proceedings of the Euromicro ’91 Workshop on Real Time Systems, pp. 41–50, Paris-Orsay, France, 1991.
[20]
D. May, J. Irwin, H. L. Muller, and D. Page, “Effective caching for multithreaded processors,” in P. H. Welch and A. W. P. Bakkers, editors, Communicating Process Architectures 2000, pp. 145–154. IOS Press, September 2000.
3 CHAPTER
Efficient Packet Classification with Digest Caches Francis Chang, Wu-chang Feng, Wu-chi Feng Department of Computer Science, Portland State University Kang Li Department of Computer Science, University of Georgia
As the number of hosts and network traffic continues to grow, the need to efficiently handle packets at line speed becomes increasingly important. Packet classification is one technique that allows in-network devices such as firewalls, network address translators, and edge routers to provide differentiated service and access to network and host resources by efficiently determining how the packet should be processed. These services require a packet to be classified so that a set of rules can be applied to such network header information as the destination address, flow identifier, port number, or layer-4 protocol type. The development of more efficient classification algorithms has been the focus of many research papers [1–6]. However, the hardware requirements of performing a full classification on each packet at current line rates can be overwhelming [7]. Moreover, there does not appear to be a good algorithmic solution for multiple field classifiers containing more than two fields [8]. A classic approach to managing data streams that exhibit temporal locality is to employ a cache that stores recently referenced items. Packet classification is no different [9]. Such caches have been shown to increase the performance of route lookups significantly [10, 11]. How well a cache design performs is typically measured by its hit rate for a given cache size. Generally, as additional capacity is added to the cache, the hit rates and performance of the packet classification engine should increase. Unlike route caches that need to store only destination address information, packet classification caches require the storage of full packet headers. Unfortunately, due to the increasing size of packet headers (the eventual deployment of IPv6 [12]), storing full header information can be prohibitively expensive given the high-speed memory that would be required to implement such a cache.
3
34
Efficient Packet Classification with Digest Caches
Recently, we proposed a third axis for designing packet classification algorithms: accuracy [13]. That is, given a certain amount of error allowed in packet classification, can packet classification speeds be significantly increased? In a previous paper, we proposed the use of a modified Bloom filter [14] for packet classification. In that approach, classified packets satisfying a binary predicate are inserted into the filter that caches the decision. For instance, a network bridge would add flows that it has identified that it should forward to the Bloom filter. Subsequent packets then query the filter to quickly test membership before being processed further. Packets that hit in the filter are processed immediately, based on the predicate, while packets that miss go through the slower full packet classification lookup process. There are three primary limitations of this Bloom filter cache design. First, each Bloom filter lookup requires N independent memory accesses, where N is the number of hash levels of the Bloom filter. For a Bloom filter optimized for a one in a billion packet misclassification probability, N = 30. Second, no mechanism exists to recover the current elements in a Bloom filter, preventing it from using efficient cache replacement mechanisms such as LRU. Finally, a Bloom cache is effective only in storing less than 256 binary predicates. Thus, it is not an appropriate data structure to attach an arbitrary amount of data, due to the increasing number of Bloom filters required to support the data. In this work, we propose the notion of digest caches for efficient packet classification. The goal of digest caches is similar to Bloom-filter caches in that they trade some accuracy in packet classification in exchange for increased performance. Digest caches, however, allow traditional cache management policies such as LRU to be employed to better manage the cache over time. Instead of storing a Bloom filter signature of a flow identifier (source and destination IP addresses and ports and protocol type), it is necessary only to store a hash of the flow identifier, allowing for smaller-sized cache entries. We will also discuss how to extend this idea to accelerate exact caching strategies by building multilevel caches with digest caches. Section 3.1 covers related work while Section 3.2 outlines the design of our architecture. Section 3.3 evaluates the performance of our design using sample network traces, while Section 3.4 discusses the performance overhead incurred by our algorithm as measured on the IXP1200 network processor platform.
3.1
RELATED WORK Due to the high processing costs of packet classification, network appliance designers have resorted to using caches to speed up packet processing time. Early
3.2
Our Approach
work in network cache design borrowed concepts from computer architecture (LRU stacks, set-associative multilevel caches) [10]. Some caching strategies rely on CPU L1 and L2 caches [7] while others attempt to map the IP address space to memory address space in order to take advantage of the hardware TLB [15]. Another approach is to add an explicit timeout to an LRU setassociative cache to improve performance by reducing thrashing [11]. More recently, in addition to leveraging the temporal locality of packets observed on networks, approaches to improving cache performance have applied techniques to compress and cache IP ranges to take advantage of the spatial locality in the address space of flow identifiers as well [16, 17]. This effectively allows multiple flows to be cached in a single cache entry, so that the entire cache may be placed into small high-speed memory such as a processor’s L1/L2 cache. There has been work using Bloom filters to accelerate exact prefix-matching schemes [18]. Much of this work is not applicable to higher-level flow identification, which is the motivation for our work. Additionally, all of these bodies of work are fundamentally different from the material presented in this chapter, because they consider only exact caching strategies. Our approach attempts to maximize performance given constrained resources and an allowable error rate.
3.2
OUR APPROACH Network cache designs typically employ simple set-associative hash tables, ideas that are borrowed from their traditional memory management counterparts. The goal of the hash tables is to quickly determine the operation or forwarding interface that should be used, given the flow identifier. Hashing the flow identifier allows traditional network processors to determine what operation or forwarding interface should be used while examining only a couple of entries in the cache. We believe one limitation of exact matching caches for flow identifiers is the need to store quite large flow identifiers (e.g., 37 bytes for an IPv6 flow identifier) with each cache entry. This limits the amount of information one can cache or increases the time necessary to find information in the cache. In this chapter, we propose the notion of digest caches. The most important property of a digest cache is that it stores only a hash of the flow identifier instead of the entire flow identifier. The goal of the digest is to significantly reduce the amount of information stored in the cache, in exchange for a small amount of error in cache lookups. As will be described later in this section, digest caches can be used in two ways. First, they can be used as the only cache for the packet
35
3
36
Efficient Packet Classification with Digest Caches
classifier, allowing the packet classifier caches to be small. Second, they can be used as an initial lookup in an exact classification scenario. This allows a system to quickly partition the incoming packets into those that are in the exact cache and those that are not. In the rest of this section, we will motivate approximate algorithms for packet classification caches. We will then focus on properties of the digest cache, comparing it to previously proposed Bloom-filter-based packet classifiers, and using it to speed up exact packet classifiers. Digest caches are superior to Bloom caches in two ways. Cache lookups can be performed in a single memory access, and they allow direct addressing of elements, which can be used to implement efficient cache eviction algorithms, such as LRU.
3.2.1
The Case for an Approximate Algorithm For the purposes of this study, we use a misclassification probability of one in a billion. Typically, TCP checksums will fail for approximately 1 in 1100 to 1 in 32,000 packets, even when link-level CRCs should admit error rates of only 1 in 4 billion errors. On average, between 1 in 16 million to 1 in 10 billion TCP packets will contain an undetectable error [19]. We contend that a misclassification probability of this magnitude will not meaningfully degrade network reliability. It is the responsibility of the end system to detect and compensate for errors that may occur in the network [20]. Errors in the network are typically self-healing in the sense that misdirected flows will be evicted from the cache as they age. Moreover, the network already guards against misconfigurations and mistakes made by the hardware. For example, the IP TTL fields are used to protect against routing loops in the network. Another argument underscoring the unreliability of the network is that TCP flows that are in retransmission timeout (RTO) mode are of no use. Consider a web browser. Flows that are stalled in RTO mode often result in the user reestablishing a web connection. In the case that a reload is necessary, a new ephemeral port will be chosen by the client, and thus a new flow identifier is constructed. If an approximate cache has misclassified a previous flow, it will have no impact on the classification of the new flow. In some cases, such as firewalls, it is undesirable for the cache systems to have errors. To “harden” approximate caching hardware against misclassifications, layer-4 hints, such as TCP SYN flags can be used to force a full packet classification pass to ensure that new flows are not misclassified.
3.2
Our Approach
3.2.2
37
Dimensioning a Digest Cache The idea of our work is simply the direct comparison of hashed flow identifiers to match cached flows. In this sense, we will trade the accuracy of a cache for a reduced storage requirement. We will partition memory into a traditional, set-associative cache. When constructing our digest cache, we first need to decide how to allocate memory. Previous work has demonstrated that higher cache associativity yields better cache hit rates [10, 21]. However, in the case of the digest cache, an increase in the degree of associativity must be accompanied by an increase in the size of the flow identifier’s hash, to compensate for the additional probability of collision. If the digest is a c-bit hash, and we have a d-way set-associative cache, then the probability of cache misidentification is p≈
d 2c
(3.1)
The equation can be described as follows: Each cache line has d entries, each entry of which can take 2c values. A misclassification occurs whenever a new entry has coincidentally the same hash value as any of the existing d entries. We must employ a stronger hash to compensate for increasing collision opportunities (associativity). Figure 3.1 graphs the number of flows that a four-way set-associative can store, assuming different misclassification probability tolerances. The maximum number of addressable flows increases linearly with the amount of memory and decreases logarithmically with the packet misclassification rate.
3.2.3
Theoretical Comparison To achieve a misclassification probability of one in a billion, a Bloom filter cache must use 30 independent hash functions to optimally use memory. This allows us to store a maximum of k flows in our cache [13], kBloomcache =
ln(1 − p1/L ) ln(1 − L/M)
(3.2)
where L = 30, the number of hash functions; M, the amount of memory, in bits; and p, the misidentification probability. To compare directly with a digest cache, the maximum number of flows that our scheme can store, independent of the
3
38 4500
p = 1e-4 p = 1e-5 p = 1e-6 p = 1e-7 p = 1e-8 p = 1e-9 p = 1e-10
4000 Maximum number of flows, k
Efficient Packet Classification with Digest Caches
3500 3000 2500 2000 1500 1000 500 0
3.1 FIGURE
0
1
2 3 4 5 6 Amount of memory, M(in KB)
7
8
Maximum number of flows that can be addressed in a four-way set associative digest cache, with different misclassification probabilities, p.
associativity, is given by kdigest =
M c
(3.3)
where the required number of bits in the digest function is given by c = log2 (d/p)
(3.4)
This relation is dependent on p, the misidentification probability and d, the desired level of cache set-associativity. The derivation of this formula follows from Equation 3.1. Figure 3.2 compares the storage capacity of both caching schemes. Both schemes linearly relate storage capacity to available memory, but it is interesting to note that simply storing a hash is more than 35 percent more efficient in terms of memory use than a Bloom filter, for this application. One property that makes a Bloom filter a useful algorithm is its ability to insert an unlimited number of signatures into the data structure, at the cost of increased misidentification.
3.2
Our Approach
39
Maximum number of flows, k
3000 8bit digest cache 32bit digest cache Bloom filter cache Exact cache, IPv4 Exact cache, IPv6
2500 2000 1500 1000 500 0
3.2 FIGURE
0
1
2 3 4 5 6 Amount of memory, M (in KB)
7
8
Comparison of storage capacity of various caching schemes. The Bloom filter cache assumes a misidentification probability of one in a billion, which under optimal conditions is modeled by a Bloom filter with 30 hash functions.
However, since we prefer a bounded misclassification rate, this property is of no use to the solution to our problem.
3.2.4
A Specific Example of a Digest Cache To illustrate the operation of a digest cache, we will construct an example application of a digest cache. Suppose we have a router with 16 interfaces and a set of classification rules, R. We begin by assuming that we have 64 KB of memory to devote to the cache and wish to have a four-way associative cache that has a misclassification probability of one in a billion. These parameters can be fulfilled by a 32-bit digest function, with 4 bits used to store per-flow routing information. Each cache entry is then 36 bits, making each cache line 144 bits (18 bytes). 64 KB of cache memory partitioned into 18-byte cache lines, gives a total of 3640 cache lines, which allows our cache to store 14,560 distinct entries. A visual depiction of this cache is given in Figure 3.3.
3
40
Efficient Packet Classification with Digest Caches
Overview of Digest Cache: Cache Line 0
{
entry 0
entry 1
entry 2
entry 3
Cache Line 1
{
entry 4
entry 5
entry 6
entry 7
entry 14556
entry 14557
entry 14558
entry 14559
Cache Line 3639 {
32-bit digest
4-bit route
Contents of cache entry
3.3 FIGURE
An overview of 64 KB four-way set-associative digest cache, with a misclassification probability of one in a billion. This cache services a router with 16 interfaces.
Now, let us consider a sample trace of the cache, which is initially empty. Suppose two distinct flows, A and B. 1. Packet 1 arrives from flow A. a. The flow identifier of A is hashed to H1 (A) to determine the cache line to look up. That is, H1 is a map from flow identifier to cache line. b. A is hashed again to H2 (A), and compared to all four elements of the cache line. There is no match. The result H2 (A) is the digest of the flow identifier that is stored. c. A is classified by a standard flow classifier, and is found to route to interface 3. d. The signature H2 (A), is placed in cache line H1 (A), along with its routing information (interface 3). e. The packet is forwarded through interface 3. 2. Packet 2 arrives from flow A. a. The flow identifier of A is hashed to H1 (A) to determine the cache line to look up. b. A is hashed again to H2 (A), and compared to all four elements of the cache line. There is a match, and the packet is forwarded to interface 3. 3. Packet 3 arrives from flow B. a. The flow identifier of B is hashed to H1 (B) to determine the cache line to look up. Coincidentally, H1 (A) = H1 (B). b. B is hashed again to H2 (B), and compared to all four elements of the cache line. Coincidentally, H2 (A) = H2 (B). There is a match, and the packet is forwarded to interface 3. The probability that this sort of misclassification occurs has a probability of 4/232 ≈ 10−9 .
3.2
Our Approach
41
In the absence of misclassifications, this scheme behaves exactly as a four-way set-associative cache with 14,560 entries (3640 cache lines). Using an equivalent amount of memory (64 KB) a cache storing IPv4 flow identifiers will be able to store 4852 entries, and a cache storing IPv6 flow identifiers will be able to store 1744 entries. The benefit of using a digest cache is two-fold. First, it increases the effective storage capacity of cache memory, allowing the use of smaller, faster memory. Second, it reduces the memory bandwidth required to support a cache by reducing the amount of data required to match a single packet. As intuition and previous studies would indicate, a larger cache will improve cache performance [10, 21, 22]. To that end, in this example, the deployment of a digest cache would have an effect of increasing the effective cache size by a factor of two to six.
3.2.5
Exact Classification with Digest Caches Digest caches can also be used to accelerate exact caching systems, by employing a multilevel cache (see Figure 3.4). A digest cache is constructed, in conjunction with an exact cache that shares the same dimensions. While the digest cache stores only a hash of flow identifiers, the exact cache stores the full flow identifier. Thus, the two hierarchies can be thought of as “mirrors” of each other. A c-bit, d-way set-associative digest cache implemented in a sequential memory access model will be able to reduce the amount of exact cache memory accessed (due to cache misses) by a factor of pmiss_savings =
Cache lookup
3.4 FIGURE
1 2c
(3.5)
Digest cache
Exact cache
A multilevel digest-accelerated exact cache. The digest cache allows you to filter potential hits quickly, using a small amount of faster memory.
3
42
Efficient Packet Classification with Digest Caches
while the amount of exact cache memory accessed by a cache hit is reduced by a factor of phit_savings =
1 1 d−1 + c× d 2 d
(3.6)
The intuition behind Equation 3.6 is that each cache hit must access the exact flow identifier, while each associative cache entry has an access probability of 2−c . Note that the digest cache allows for multiple entries in a cache line to share the same value because the exact cache can resolve collisions of this type. Since this application relies on hashing strength only for performance and not for correctness, it is not necessary to have as strong a misclassification rate. A multilevel 8-bit four-way set-associative digest-accelerated cache will incur a 4-byte first-level lookup overhead. However, it will reduce second-level memory access cost of an IPv6-bit cache miss lookup from 148 bytes to 37.4 bytes, and a cache miss lookup from 148 bytes to 0.6 bytes. Assuming a 95 percent hit rate, the average cost of cache lookups is reduced to 4 bytes of first-level cache and 35.6 bytes of second-level cache.
3.3
EVALUATION For evaluation purposes, we used two datasets, each one hour in length. The first of the datasets was collected by Bell Labs research, Murray Hill, NJ, at the end of May 2002. This dataset was made available through a joint project between NLANR PMA and Internet Traffic Research Group [23]. The trace was of a 9 Mb/s Internet link, serving a staff of 400 people. The second trace was a nonanonymized trace collected at our university OC-3c link. Our link connects with Internet2 in partnership with the Portland Research and Education Network (PREN). This trace was collected on the afternoon of July 26, 2002. Table 3.1 presents a summary of the statistics of these two datasets. A graph of the number of concurrent flows is shown in Figure 3.5. For the purposes of our graph, a flow is defined to be active between the time of its first and last packet, with a 60-second maximum interpacket spacing. This number is chosen in accordance with other measurement studies [24, 25]. A reference “perfect cache” was simulated. We define a perfect cache to be a fully associative cache with an infinite amount of memory. Thus, a perfect cache takes only compulsory cache misses. The results are presented in Table 3.2. The
3.3
Evaluation
43 Bell Trace Trace Length (seconds) Number of Packets
OGI Trace
3, 600
3, 600
974, 613
15, 607, 297
Avg. Packet Rate (Packets per Second)
270.7
4, 335.4
TCP Packets
303, 142
5, 034, 332
UDP Packets
671, 471
10, 572, 965
Number of Flows
32, 507
160, 087
Number of TCP Flows
30, 337
82, 673
Number of UDP Flows
2, 170
77, 414
Avg. Flow Length (seconds)
3.27
Longest Flow (seconds)
3, 599.95
Avg. Packets/Flow
10.21 3, 600
29.98
97.49
Avg. Packets/TCP Flow
9.99
60.89
Avg. Packets/UDP Flow
309.43
136.58
Max # of Concurrent Flows
268
567
Summary statistics for the sample traces.
3.1 TA B L E
OGI Trace Bell Trace
600
Number of flows
500 400 300 200 100 0
0
500
1000
1500
2000
Time (seconds)
3.5 FIGURE
Number of concurrent flows in test data sets.
2500
3000
3500
3
44
Efficient Packet Classification with Digest Caches
Bell Trace
3.2
OGI Trace
Hit Rate
0.971
0.988
Intrinsic Miss Rate
0.029
0.012
Maximum misses (over 100 ms intervals)
6
Variance of misses (over 100 ms intervals)
1.3540
17.438
Average misses (over 100 ms intervals)
0.775
5.843
189
The results of simulating a perfect cache.
TA B L E
OGI trace captured a portion of an active half-life game server, whose activity is characterized by a moderate number (∼20) of long-lived UDP flows.
3.3.1
Reference Cache Implementations A Bloom filter cache [13] was simulated, using optimal dimensioning. Both cold caching and double-buffered aging strategies were run on the benchmark datasets. Optimal dimensioning for a misclassification probability of one in a billion requires 30 independent hash functions, meaning that each cache lookup and insertion operation requires 30 independent one-bit memory accesses. The digest cache presented in this chapter was chosen to be a four-way set associative hash table, using 32-bit flow identifier digests. Each lookup and insertion operation requires a single 16-byte memory request. An LRU cache replacement algorithm was chosen, due to its low cost complexity and nearoptimal behavior [10]. A four-way set-associative cache was chosen, because it performs almost as well as a fully associative cache [21] Figure 3.6 graphs the behavior of digest caches with different set-associativities. We also compare our cache against a traditional four-way set associative layer-four IPv4- and IPv6-based hash tables. Each lookup and insertion operation requires a single 52-byte or 148-byte memory request, respectively. Hashing for all results presented here was accomplished with a SHA-1 [26] hash. It is important to note that the cryptographic strength of the SHA-1 hash is not an important property of an effective hashing function in this domain. It is sufficient that it is a member of the class of universal hash functions [27].
3.3
Evaluation
45 100
Bell trace hit rate (%)
95 90 85 80 35-bit digest cache (32-way associative) 34-bit digest cache (16-way associative) 33-bit digest cache (8-way associative) 32-bit digest cache (4-way associative) 31-bit digest cache (2-way associative) 30-bit digest cache (1-way associative)
75 70 65
1000
10,000 Amount of cache memory (bytes)
100,000
100
OGI trace hit rate (%)
80
60
40
35-bit digest cache (32-way associative) 34-bit digest cache (16-way associative) 33-bit digest cache (8-way associative) 32-bit digest cache (4-way associative) 31-bit digest cache (2-way associative) 30-bit digest cache (1-way associative)
20
0
3.6 FIGURE
1000
10,000 Amount of cache memory (bytes)
100,000
Hit rates for digest caches, as a function of memory for various set associativity, assuming a misclassification rate of one in a billion.
3
46
3.3.2
Efficient Packet Classification with Digest Caches
Results In evaluating the performance of the caching systems, we must consider two criteria: we must examine the overall hit rate as well as the smoothness of the cache miss rate. A cache that has large bursts of cache misses is of no use, because it places strain on the packet classification engine. Figure 3.7 graphs the resulting hit rate of various caching strategies, using the sample traces. As expected, the digest cache scores hit rates equivalent to an IPv6-based cache ten times its size. More importantly, the digest cache still manages to outperform a Bloom filter cache. The digest cache yields an equivalent hit rate of a cold-caching Bloom filter 50–80 percent its size, and outperforms a double-buffered Bloom filter cache two–three times its size. Figure 3.8 graphs the variance of cache miss rates of the different caching approaches, aggregated over 100 ms intervals. As can be observed from the two traces, a digest cache gives superior performance, minimizing the variance in aggregate cache misses. It is interesting to note that for extremely small cache sizes, the digest cache exhibits a greater variance in hit rate than almost all other schemes. This can be attributed to the fact that the other algorithms in this interval behave uniformly poor by comparison. As the cache size increases, this hit-rate performance improves, and the variance of cache miss rates decreases to a very small number. This is an important observation because it implies that cache misses, in these traces, are not dominated by bursty access patterns. To consider a more specific example, we have constructed a 2600 byte four-way set-associative digest cache. This number was chosen to be coincidental with the amount of local memory available to a single IXP2000 family microengine. Figure 3.9 presents a trace of the resulting cache miss rate, aggregated over one-second intervals. This graph represents the number of packets a packet classification engine must process within one second to keep pace with the traffic load. As can be observed from the plot, a packet classification engine must be able to classify roughly 60 packets per second (pps) in the worst case for the Bell trace, and 260 pps in the worst case for the OGI trace. Average packet load during the entire trace is 270.7 and 4335.4 pps for the Bell and OGI traces, respectively. Peak packet rate for the Bell trace approached 1400 pps, while the peak rate for the OGI trace exceeds 8000 pps. By employing a 2600-byte digest cache, the peak stress level on the packet classification engine has been reduced by a factor of between 20 and 30 for the observed traces.
3.3
Evaluation
47 100
Bell trace hit rate (%)
90 80 70 Digest cache (4-way associative) Bloom cache (Cold) Bloom cache(double buffered) Exact cache (IPv4 4-way associative) Exact cache (IPv6 4-way associative) Perfect cache
60 50 40
1000
10,000
100,000
Amount of cache memory (bytes)
OGI trace hit rate (%)
100 80 60 40 20
Digest cache (4-way associative) Bloom cache (Cold) bloom Cache(double buffered) Exact cache (IPv4 4-way associative) Exact cache (IPv6 4-way associative) Perfect cache
0
1000
10,000
100,000
Amount of cache memory (bytes)
3.7 FIGURE
Cache hit rates as a function of memory, M. The Bell trace is on the left, and the OGI trace is on the right.
3
48
Digest cache (4-way associative) Bloom cache (Cold) Bloom cache(double buffered) Exact cache (IPv4 4-way associative) Exact cache (IPv6 4-way associative)
1000 Bell trace, variance of missses
Efficient Packet Classification with Digest Caches
100
10
1
1000
10,000
100,000
Amount of cache memory (bytes)
Digest cache (4-way associative) Bloom cache (Cold) Bloom cache(double buffered) Exact cache (IPv4 4-way associative) Exact cache (IPv6 4-way associative)
OGI trace, variance of missses
10000
1000
100
10
1000
10,000
100,000
Amount of cache memory (bytes)
3.8 FIGURE
Variance of cache misses as a function of memory, M (aggregated over 100 ms time scales). The Bell trace is on the left, and the OGI trace is on the right.
Hardware Overhead
49
120
Number of cache misses over 1 second invervals
Number of cache misses over 1 second invervals
3.4
Bell Trace 100
80
60
40
20
0 0
500
1000
1500
2000
2500
3000
450 OGI Trace 400 350 300 250 200 150 100
3500
Time since start of trace (seconds)
3.9 FIGURE
3.4
50 0 0
500
1000
1500
2000
2500
3000
3500
Time since start of trace (seconds)
Cache miss rates aggregate over one-second intervals, using a 2600 byte fourway set-associative digest cache. The Bell trace gave a 95.9% hit rate, while the OGI trace achieved a 97.6% hit rate.
HARDWARE OVERHEAD A preliminary implementation on Intel’s IXP1200 Network Processor [28] was constructed to estimate the amount of processing overhead a cache would add. The hardware tested was an IXP1200 board, with a 200 MHz StrongARM, 6 packet-processing microengines, and 16 ethernet ports. A simple microengine level layer-three forwarder was implemented as a baseline measurement. A cache implementation was then grafted onto the layerthree forwarder code base. A null-classifier was used, so that we could isolate the overhead associated with the cache access routines. The cache was placed into SRAM, because scratchpad memory does not have a pipelined memory access queue, and the SDRAM interface does not support atomic bit-set operations. The simulation was written entirely in microengine C and performance tests were run in a simulated virtual machine. A trie-based longest prefix match on the destination address is always performed, regardless of the outcome of the cache operation.
3.4.1
IXP Overhead The performance of our implementation was evaluated on a simulated IXP1200 system, with 16 virtual ports. The implementation’s input buffers were kept constantly filled, and we monitored the average throughput of the system.
3
50
3.3 TA B L E
Efficient Packet Classification with Digest Caches
Number of Hash Levels
All-Miss Cache Throughput
0
990 Mb/s
1
868 Mb/s
2
729 Mb/s
3
679 Mb/s
4
652 Mb/s
5
498 Mb/s
Performance of Bloom filter caches in worst-case data flows, on a simulated IXP1200.
The IXP1200 has a three-level memory hierarchy, scratchpad, SRAM and SDRAM, each having 4 KB, 16 MB, and 256 MB, respectively. Scratchpad memory is the fastest of the three, but does not support queued memory access— subsequent scratchpad memory accesses block until the first access is complete. The IXP micro-code allows for asynchronous memory access to SRAM and SDRAM. The typical register allocation schema allows for a maximum of 32 bytes to be read per memory access. The cache implementation we constructed was designed in a way to ensure that no flow identifier was successfully matched, and each packet required an insertion of its flow ID into the cache. This was done so that the worst possible performance of a Bloom filter cache could be ascertained. The code was structured in a way to disallow any shortcutting or early negative membership confirmation. The performance results of the IXP implementation are presented in Table 3.3, using a trace composed entirely of small, 64-byte packets. By comparison, a four-way set-associative digest cache was able to maintain a sustained average throughput of 803 Mb/s. The IXP is far from an ideal architecture to implement a Bloom filter, in large part due to its lack of small, high-speed, bit-addressable, on-chip memory. Ideally, a Bloom filter would be implemented in hardware that supports parallel access on bit-addressable memory [29]. Nevertheless, the performance results presented here serve to underscore the flexibility of our new cache design; specialized hardware is not required.
3.4.2
Future Designs The next generation IXP2000 hardware will feature 2560 bytes of on-chip memory per microengine, improving access latencies by a factor of fifteen [30, 31].
3.5
Conclusions
Let us consider implementing a packet classification cache on this architecture. If we used this memory for an exact IPv4 cache, we would be able to store a maximum of 196 flow identifiers. An equivalent IPv6 cache would be able to store only 69 flows. Using this memory in a 32-bit four-way set-associative digest cache will allow each microengine to cache 640 flows. If we use an 8-bit four-way set-associative exact digest cache, we can use just 1 KB of on-chip memory, and 38 KB of SRAM, to store over 1000 flows per microengine. The ability for this algorithm to reduce the amount of memory required to store a flow identifier is especially important in this architecture, because of the limited nature of memory transfer registers. Each microengine thread has access to 16 32-bit memory transfer registers, which means that fetching more than one IPv6 flow identifier requires multiple, independent memory accesses, which must be serialized. Since independent memory accesses are significantly more expensive than single, longer memory accesses, this significantly penalizes the performance of a traditional set-associative cache. Coupled with the fact that these memory accesses must be serialized (the first access must complete before the second one can be initiated) the performance benefit of avoiding SRAM memory accesses becomes overwhelmingly important. For comparison, a modern TCAM implementation can perform 100 million lookups per second [32]. The IXP2000 can perform 233 million local memory accesses per second [31]. Without even considering the cost or power required to maintain a TCAM, a digest cache becomes a promising alternative. These arguments make our proposed techniques a prime candidate for creating efficient caches for use on future network processors.
3.5
CONCLUSIONS Typical packet classification caches trade off size and performance. In this chapter, we have proposed a novel cache architecture that efficiently and effectively uses memory, given a slightly relaxed accuracy requirement. Performance of any existing flow-caching solution that employs exact caching algorithms can be improved dramatically by employing our technique, at the sacrifice of a small amount of accuracy. Our new technique is superior to previous Bloom filter approximatecaching algorithms, in both theoretical and practical performance, while also addressing the shortcomings in the previous Bloom filter cache design without introducing any additional drawbacks.
51
3
52
Efficient Packet Classification with Digest Caches
This technique can be applied to the design of a novel two-level exact cache, which can take the advantage of hierarchical memory to accelerate exact caching algorithms, with strong results.
ACKNOWLEDGMENTS We would like to thank Ed Kaiser and Chris Chambers for their comments regarding draft versions of this paper. We would also like to thank our anonymous reviewers for their feedback. The National Science Foundation under Grant EIA-0130344 and the generous donations of Intel Corporation supported this work. Any opinions, findings, or recommendations expressed are those of the author(s) and do not necessarily reflect the views of NSF or Intel.
REFERENCES [1]
F. Baboescu and G. Varghese, “Scalable packet classification,” Proceedings of ACM SIGCOMM 2001, August 2001, pp. 199–210.
[2]
A. Feldman and S. Muthukrishnan, “Tradeoffs for Packet Classification,” IEEE INFOCOM, 2000, pp. 1193–1202.
[3]
P. Gupta and N. McKeown, “Algorithms for packet classification,” IEEE Network Special Issue, March/April 2001, 15(2), pp. 24–32.
[4]
T. V. Lakshman and D. Stiliadis, “High-speed policy-based packet forwarding using efficient multi-dimensional range matching,” Proceedings of the ACM SIGCOMM 1998, August, 1998, pp. 203–214.
[5]
V. Srinivasan, G. Varghese, S. Suri, and M. Waldvogel, “Fast and scalable layer four switching,” Proceedings of ACM SIGCOMM 1998, September 1998, pp. 191–202.
[6]
Qiu, L., G. Varghese, and S. Suri, “Fast firewall implementations for software and hardware-based routers,” Proceedings of ACM SIGMETRICS 2001, Cambridge, Massachusetts, June 2001, pp. 344–345.
[7]
C. Partridge et al., “A 50 GB/s IP Router,” IEEE/ACM Transactions on Networking, pp. 237–248, June 1998.
[8]
F. Baboescu, S. Singh, and G. Varghese, “Packet classification for core routers: Is there an alternative to CAMs?” Proceedings of IEEE Infocom 2003, pp. 53–63.
[9]
k. claffy, “Internet Traffic Characterization,” Ph.D. thesis, University of California, San Diego, 1994.
[10]
R. Jain, “Characteristics of destination address locality in computer networks: A comparison of caching schemes,” Journal of Computer Networks and ISDN Systems, 18(4), May 1990, pp. 243–254.
References
[11]
53 J. Xu, M. Singhal, and J. Degroat, “A novel cache architecture to support layer-four packet classification at memory access speeds,” Proceeding of INFOCOM 2000, March 2000, pp. 1445–1454.
[12]
C. Huitima, IPv6: The New Internet Protocol (2nd Edition), Prentice-Hall, 1998.
[13]
F. Chang, K. Li, and W. Feng, “Approximate caches for packet classificiation,” Proceedings of IEEE INFOCOM ’04, Hong Kong, March 2004.
[14]
B. H. Bloom, “Space/time tradeoffs in hash coding with allowable errors,” Communications of ACM 13, 7 (July 1970), pp. 422–426.
[15]
T. Chiueh and P. Pradhan, “High performance IP routing table lookup using CPU caching,” Proceedings of IEEE INFOCOMM ’99, New York, March 1999, pp. 1421–1428.
[16]
T. Chiueh and P. Pradhan, “Cache memory design for network processors,” Sixth International Symposium on High-Performance Computer Architecture (HPCA 2000), pp. 409–418.
[17]
K. Gopalan and T. Chiueh, “Improving route lookup performance using network processor cache,” Proceedings of the IEEE/ACM SC 2002 Conference, pp. 1–10.
[18]
S. Dharmapurikar, P. Krishnamurthy, and D. E. Taylor, “Longest prefix matching using Bloom filters,” Proceedings of ACM SIGCOMM ’03, August 25–29, 2003, Karlsruhe, Germany, pp. 201–212.
[19]
J. Stone and C. Partridge, “When the CRC and TCP checksum disagree,” Proceedings of the ACM SIGCOMM 2000 Conference (SIGCOMM-00), August 2000, pp. 309–319.
[20]
J. Saltzer, D. Reed, and D. Clark, “End-to-End Arguments in System Design,” ACM Transactions on Computer Systems, 2(4), 1984, pp. 277–288.
[21]
K. Li, F. Chang, and W. Feng, “Architectures for packet classification,” Proceedings of the 11th IEEE International Conference on Networks (ICON 2003), pp. 111–117.
[22]
C. Partridge, “Locality and route caches,” NSF Workshop on Internet Statistics Measurement and Analysis, www.caida.org/outreach/isma/9602/positions/partridge.html, 1996.
[23]
“Passive Measurement and Analysis Project,” National Laboratory for Applied Network Research (NLANR), pma.nlanr.net/Traces/Traces/.
[24]
C. Fraleigh, S. Moon, C. Diot, B. Lyles, and F. Tobagi, “Packet-Level Traffic Measurements from a Tier-1 IP Backbone,” Sprint ATL Technical Report TR01-ATL-110101, November 2001, Burlingame, California.
[25]
S. McCreary and k. claffy, “Trends in wide area IP traffic patterns a view from Ames Internet exchange,” ITC Specialist Seminar, Monterey, California, May 2000.
[26]
FIPS 180-1. “Secure Hash Standard. U.S. Department of Commerce/N.I.S.T.,” National Technical Information Service, Springfield, Virginia, April 1995.
[27]
L. Carter and M. Wegman, “Universal classes of hash functions,” Journal of Computer and System Sciences (1979), pp. 143–154.
[28]
Intel IXP1200 Network Processor, www.intel.com/design/network/products/npfamily/ixp1200.htm.
3
54
Efficient Packet Classification with Digest Caches
[29]
L. Sanchez, W. Milliken, A., Snoeren, F. Tchakountio, C. Jones, S. Kent, C. Partridge, and W. Strayer, “Hardware support for a hash-based IP traceback,” Proceedings of the 2nd DARPA Information Survivability Conference and Exposition, June 2001, pp. 146–152.
[30]
D. Comer, Network Systems Design Using Network Processors, Prentice-Hall, 2003.
[31]
E. Johnson and A. Kunze, IXP1200 Programming, Intel Press, 2002.
[32]
SiberCore Technologies, SiberCAM Ultra-4.5M SCT4502 Product Brief, 2003.
4 CHAPTER
Towards a Flexible Network Processor Interface for RapidIO, Hypertransport, and PCI-Express Christian Sauer Infineon Technologies, Corporate Research, Munich, Germany Matthias Gries, Kurt Keutzer University of California, Electronics Research Lab, Berkeley Jose Ignacio Gomez Universidad Complutense de Madrid, Spain Emerging new communication protocol standards, such as PCI Express and RapidIO, give new momentum to the still unresolved search for the right set of standardized interfaces for network processors (NPUs). Network processors are often deployed on line cards in network core and access equipment [1]. Line cards manage the connection of the equipment to the physical network. They are connected to a backplane that allows packets to be distributed to other line cards or to control processors. Due to this usage, NPUs require a number of dedicated interfaces: network interfaces connect the processor to the physical network, a switch fabric interface accesses the backplane, a control plane interface connects to an external processor that handles the control plane and also maintains the NPU and its peers on the card, memory interfaces for packets and table lookup data, and coprocessor interfaces that are used for classification, quality-of-service, and cryptography accelerators. While some of these interfaces, for example, memory and network interfaces, are well covered and documented by different I/O standards, others, especially the switch fabric interface, still lack sufficient standardization [2]. The recently announced enhancements to packet-oriented, link-to-link communication protocols, such as RapidIO, Hypertransport, and PCI Express, are aimed at addressing this issue [3]. Initially these interfaces were developed for host-centric systems with strong support of storage semantics similar to the original PCI specification. In the network domain, we find them deployed in their traditional role, either as control-plane or coprocessor interface [4].
4
56
Towards a Flexible Network Processor Interface
Recent extensions, such as peer-to-peer communication and message-passing semantics, low pin count, and scalable bandwidth, make them reasonable candidates for the switch interface of network processors. It is, however, unclear, which of these interfaces should be supported by a network processor. Not only do these interfaces represent different market alliances, but they also provide (or will provide as they evolve) comparable features at a similar performance/cost ratio, as Figure 4.1 illustrates. Thus, the question arises: Can we support these interfaces with one, sufficiently flexible, solution? Related Work. Figure 4.1 shows, besides RapidIO, Hypertransport, and PCI-Express, a number of other communication protocols. The parallel System Packet Interfaces (SPI) are commonly used to connect to external physical layer devices (PHYs) in NPUs. Current generations of network processors deploy PCI for their control plane interface. We added the newer, parallel PCI-X versions for comparison, although they are used in the PC domain. The CSIX interface used by some NPUs as switch fabric interface is a parallel interface that can be scaled up to 128 bits in chunks of 32 bits. 128
32
peak data rate [Gb/s]
64
Hypertransport SPI-5 PCI – X 2.0
PCI Express
SPI - 4.1
SPI - 4.2
16 RapidIO CSIX - L1 (parallel) 8 PCI – X 1.0
RapidIO(serial) 4
PCI - 66MHz 2 RapidIO(serial)
datapath width [bits]
0 1
4.1 FIGURE
2
4
8
16
32
64
128
Performance and pin count for common NPU interfaces.
4.1
Interface Fundamentals and Comparison
The rising interest in new interconnect standards has recently been covered by several articles [2, 5, 6]. Although in Ref. [5] some details have been compared, no comprehensive picture has been drawn of the essential elements of the interfaces. A thorough survey of I/O adapters and network technologies in general is presented in Ref. [7]. The paper particularly focuses on requirements for server I/O. We are interested in the higher-level interface and protocol aspects. To us, it is especially interesting to see how much we can gain from the fact that network processors already deal with packet-oriented communication protocols in a flexible way. In Ref. [8], an architecture is described that realizes low-performance communication interfaces in software on a specialized processor with a custom operating system. A major obstacle to a flexible solution is obviously in the different physical characteristics of the protocols. This issue, however, is already addressed by initiatives, such as the Unified 10 Gbps Physical-Layer Initiative (UXPi), and individual companies, such as Rambus [9]. Apart from comparing and analyzing three selected communication standards, the goal of this chapter is also to clarify tradeoffs involved with implementing RapidIO, Hypertransport, and PCI-Express. For this purpose, we focus on the part of single end-point interfaces that is traditionally implemented in hardware by using ASICs. This chapter is organized as follows. In the next section we introduce and compare the set of fundamental tasks performed by our interfaces. In Section 4.2, we describe the functional models of RapidIO, Hypertransport, and PCI-Express by using Click. In Section 4.3, we evaluate the feasibility of implementing the communication standards in a flexible manner using existing network processor infrastructure. We summarize and conclude in Section 4.4.
4.1
INTERFACE FUNDAMENTALS AND COMPARISON Since all three communication protocols are packet-oriented and based on point-to-point links, their endpoint interfaces are fairly similar with respect to function and structure. In this section, we describe common concepts and identify fundamental tasks that are shared among them.
4.1.1
Functional Layers The structure of communication interfaces can be defined in three functional layers following the OSI/ISO reference model, as shown in Figure 4.2. Each layer provides a service to the above layer by using the service of the layer below. Peer layers on different nodes communicate with each other by using protocols, which
57
4
58
Towards a Flexible Network Processor Interface
Network processor core interface
Transaction layer
TA Tx
TA Mgmt
TA Rx
Data link layer
DLL Tx
DLL Mgmt
DLL Rx
Physical layer
PHY Tx
PHY Mgmt
PHY Rx
Physical interface/pins
Layered structure of our communication interfaces.
4.2 FIGURE
implies using services from the layer below. The three layers are as follows: ✦
Physical layer (PHY). Transmits data over the physical link by converting bits into electrical signals.
✦
Data link layer (DLL). Manages the direct link between peer nodes and reliably transmits pieces of information across this link.
✦
Transaction layer (TA). Establishes a communication stream between a pair of systems and controls the flow of information. Since switched networks require routing, different variants of the transaction layer might be used in endpoints and switches.
Data flows through these layers in two independent directions: the transmit path (Tx), which sends data from the device core to the link, and the receive path (Rx), which propagates data from the link to the device core. At each layer, additional control and status information for the peer layer of the opposite device is inserted (Tx side) into the data stream or data sent by the peer layer is filtered (Rx side) out of the stream by the layer management block. A transaction at the transaction layer is represented as a packet. The data link layer forwards these transaction packets and generates its own data link layer packets for state information exchange. The physical layer first converts all packets into a stream of symbols, which are then serialized to the final bitstream. The terminology of the layers follows the PCI-Express specification. RapidIO uses a similar layered structure (logical, transport, and physical).
4.2
Modeling the Interfaces
Hypertransport, however, does not use layers explicitly, but relies on the same functions and structural elements.
4.1.2
System Environment The communication interface interacts with its environment in two directions: to the outside world via the physical interface and to the network processor core system via the transaction interface. Both directions can also be used to access and configure the internal state of the interface. The physical link is formed by one or more lanes. A lane consists of two opposite unidirectional point-to-point connections for duplex communication. The bandwidth of the link can be scaled easily by changing the number of lanes (see Figure 4.1). In the case of serial protocols, such as RapidIO and PCI Express, a lane contains only the differential data signals. A Hypertransport lane contains an explicit clock signal, and the link an additional control signal. The transaction interface allows the NPU core either to issue a request (posted, nonposted) or to respond to an incoming request by returning a completion transaction (split transaction model). Depending on the protocol, up to four distinct address spaces (memory, IO, configuration, and message) can be defined to support different communication semantics in the system. Each interface provides a configuration space that holds the configuration and status information of the interface. Access to this space is possible via configuration transactions from both directions. A separate internal interface to the core can be provided for lean and direct access.
4.1.3
Common Tasks The set of elementary tasks, which are required in order to provide the different communication interfaces, is described in Table 4.1, starting from the physical layer upwards. Table 4.2 groups the tasks according to their appearance at the different layers in RapidIO, Hypertransport, and PCI-Express. Implementations of the same task may vary from interface to interface due to protocol specifics.
4.2
MODELING THE INTERFACES In order to determine an architecture that supports our three communication protocols, we need to describe the functionality of their interfaces first. We require a purely functional and architecture-independent description that can be mapped explicitly onto different architectural approaches.
59
4
60
Towards a Flexible Network Processor Interface
Clock recovery Serial interfaces encode the transmit clock into the bitstream. At the receiver, the clock needs to be recovered from data transitions in the bitstream using a Phase-Locked Loop. To establish bit lock, the transmitter sends specialized training sequences at initialization time. Clock compensation The PHY layer has to compensate frequency differences between the received clock and its own transmit clock to avoid clock shifts. Lane de-skewing Data on different lanes of a multilane link may experience different delays. This skew needs to be compensated at the receiver. Serialization/Deserialization The width of the internal data path has to be adjusted to the width of the lane. Serial interfaces also need to lock the bitstream onto symbol boundaries. 8b/10b coding/decoding Data bytes are encoded into 10-bit symbols to create the bit transitions necessary for clock recovery. Scrambling/Descrambling Scrambling removes repetitive patterns in the bitstream. This reduces the EMI noise generation of the link. Striping/unstriping In multilane links, the data is distributed to/gathered from individual lanes byte-wise according to a set of alignment rules. Framing/Deframing Packets received at the physical layer are framed with start and end symbols. In addition, PHY layer command sequences (e.g., link training) may be added. Cyclic redundancy check (CRC) The link layer protects data by calculation of a CRC checksum. Different CRC versions may be required depending on data type and protocol version. Ack/Nack protocol In order to establish a reliable communication link, the receiver acknowledges every error-free packet using a sequence number. In the case of transmission errors, a not-acknowledge is returned and the packet is retransmitted. Classification The classification according to packet types may be based on multiple bit fields (e.g., address, format, type) of the header. Packet assembly/disassembly The transaction layer assembles payload and header and forms outgoing packets according to the transaction type. The link layer may add an additional envelope (e.g., CRC, sequence number). The link layer may generate information packets (e.g., Ack/Nack, flow control) to update the link status of its peer. Flow control A transaction is transmitted to the receiver only if sufficient buffer space is available at the receiver. The receiver updates the transmitter periodically with the amount of available buffer space. Address validation Incoming transactions should be addressed only to the device or its memory spaces, respectively. Buffers and scheduling At least one1 set of individually flow-controlled buffers is required for all transaction types (posted, nonposted, and completion) to prevent head of line blocking. Configuration space The configuration space stores the identity of the device that is determined during the initialization phase and the negotiated link parameter. It also allows access to internal state and error logs. 1 Devices may provide additional sets for quality-of-service purposes.
4.1 TA B L E
Common tasks and elements.
4.2
Modeling the Interfaces
61
Function Clock recovery Clock compensation
RapidIO
PCI-Express
Hypertransport
PHY
DLL
TA
PHY
DLL
TA
PHY
DLL
TA
+
−
−
+
−
−
−
−
−
+
−
−
+
−
−
+
−
−
(+)
−
−
(+)
−
−
(+)
−
−
8b/10b coding
+
−
−
+
−
−
−
−
−
Scrambling
−
−
−
+
−
−
−
−
−
Striping1
(+)
−
−
(+)
−
−
(+)
−
−
Framing
+
−
−
+
−
−
−
−
−
CRC protection
−
+
−
−
+
(+)2
−
+3
−
Ack/Nack protocol
−
+
−
−
+
−
−
−
−
Classification
−
+
+
−
+
+
−
+
+
Packet assembly
−
+
+
−
+
+
−
+
+
Flow control
−
−
+
−
−
+
−
−
+
Address validation
−
−
+
−
−
+
−
−
+
Buffers and scheduling
−
−
+
−
−
+
−
−
+
Configuration space
−
−
+
−
−
+
−
−
+
Lane de-skewing1
1 Required only for multiplelane links. 2 Optional end-to-end CRC for transactions. 3 Periodic instead of per-packet CRC.
4.2
Tasks on protocol layers for the different interfaces.
TA B L E
In this section, we first discuss the use of Click for the functional description. Then, we present the individual models and discuss important communication scenarios. Our models capture the complete data and control flow for the steady state of a single-link end device. Initialization, configuration, and status reporting are simplified. Since we are not interested in physical properties, we do not model clock recovery, or any synchronization on the physical layer. We verify our models by simulation using communication patterns that have been derived manually from the specifications.
4.2.1
Click for Packet-Based Interfaces We implement our functional interface models in Click, a domain-specific framework for describing network applications [10]. We have chosen Click for several reasons: Click models are executable, implementation-independent, and
4
62
Towards a Flexible Network Processor Interface
capture inherent parallelism and dependencies among elements. Furthermore, Click’s abstraction level and the extensible element library allow us to focus on interface specifics. In Click, applications are composed in a domain-specific language from elements that can be linked by directed connections. The elements, written in C++, describe common computational network operations, whereas connections specify the flow of packets between elements. Packets are the only data type that can be communicated. All application state is kept local within elements. Two patterns of packet communication are distinguished in Click: push and pull. Push communication is initiated by a source element and models the arrival of packets into the system. Pull communication is initiated by a sink and models space that becomes available in an outbound resource. Click was originally implemented on Linux using C++. Recent extensions to Click include Refs. [11] and [12]. In Ref. [11], a multithreaded Linux implementation is shown that exploits Click’s parallelism in processing packet flows. In Ref. [12], Shah et al. show how Click, augmented with some abstracted architectural features, can be used as a programming model for network processors. This will become important as soon as we map our functional models onto a target architecture. To use Click for our purposes, a number of issues had to be addressed and resolved:
4.2.2
✦
Flow of control information. In some packet flows, state information generated downstream needs to be fed back into an upstream element. To achieve the proper granularity, we explicitly model such dependencies using tokens. In push connections a token represents the state change event. A pull token indicates reading state access.
✦
Nonpacket data types. Besides transaction and link layer packets, state tokens and symbols are used. Both of them are also represented internally by Click packets. Elements may convert packets into tokens or symbols and vice versa.
✦
Multirate elements. Interfaces require elements with different input and output rates. The Framer, for instance, converts incoming link layer packets into a sequence of symbols.
PCI Express PCI-Express is a serial, packet-oriented, point-to-point data transfer protocol [13]. There are two different versions of PCI-Express: Base and Advanced Switching. The Base specification [14] preserves the software interface of earlier PCI
4.2
Modeling the Interfaces
versions. The Advanced Switching version [15] will define a different transaction layer than the base version to add features important to the networking domain, such as protocol encapsulation, multicast, and peer-to-peer communication. For the purpose of this chapter, we use the Base specification [14] to model the critical path of an endpoint device. The Click diagram for our implementation is shown in Figure 4.3. The functionality of the elements has been described earlier in Table 4.1. Based on the model, six cases (A–F) of information flow through the interface can be identified:
1. Outbound transactions. The network processor core initiates a transaction (e.g., a read request) by transferring data and parameters into the transaction buffer, which is part of the Tx transaction layer (TaFlTx). The buffer implements at least three queues to distinguish between posted, nonposted, and completion transactions, which represents one virtual channel. From the buffer, transactions are forwarded to the data link layer depending on the priority and the availability of buffer space at the receiver side of the link. When a transaction leaves, flow control counters are updated. The data link layer (AckNackTx) adds a sequence number, encapsulates the transaction packet with a cyclic redundancy check (CRC), stores a copy in the replay buffer, and forwards it to the physical layer. At the PHY layer, the packet is framed and converted into a stream of symbols. The symbols are, if necessary, distributed onto multiple lanes, encoded and serialized into a bitstream before they are transferred to the channel. The serialization is not modeled. 2. Inbound transactions. A stream of encoded symbols enters the receive side of the PHY layer and is decoded, assuming that clock recovery, compensation, lane de-skewing, and de-serialization have already been performed. The Deframer detects and assembles symbol sequences to PHY layer commands and packets. Packets are forwarded to the data link layer. The DLL classifies incoming packets into transaction packets, link layer packets, and erroneous packets. Transaction packets that pass the CRC and have a valid sequence number are forwarded to the transaction layer (AckNackRx). Erroneous packets are discarded. For each received transaction an acknowledge or not-acknowledge response is scheduled. At the transaction layer, the received transaction (e.g., a read completion) is stored into the appropriate receive buffer queue and the network processor core is notified. As soon as the transaction is pulled from the queue, the receive flow control counters can be updated, and the transfer is completed. 3. Outbound acknowledge packets. The data link layer generates confirmation packets to acknowledge/not acknowledge the reception of transaction packets (AckNackRx). To preserve bandwidth, these packets are issued in
63
FIGURE
4.3
Network processor core
FC counters Transaction layer
Timeout
Timeout
Timeout
Classify State
Discard
Addr_validation
Tee
SetCRC16
ChCRC32
Data link layer
ChSeqN
GenAck/N
ChCRC16
AckNackTx
Timeout AckNackRx
Discard
Paint(0)
PCI-Express end-point device interface.
Fl cntr + +
GenFlUpd
Fl cntr + +
GenFlUpd
Fl cntr + +
GenFlUpd
TaFlRx
CeckSeqN
PrioritySched TaFlTx Tee
Trigger(4)
Discard
Paint(11)
PullTee Discard
Gen
Physical layer
Scramble Scramble
Trigger
8b10b 8b10b
ReplayBuf
Paint(5)
Serilize Deserilize
SetCRC32
Classify
PrioSched CheckP(0)
SetSeqN
Channel
Channel
Physical channel CheckP(2)
Paint(2)
Timeout
4
Deframer
64 Towards a Flexible Network Processor Interface
Framer
4.2
Modeling the Interfaces
scheduled intervals rather than after every received packet. Besides the ack/nack type, a packet contains the last valid sequence number and is CRC-protected. 4. Inbound acknowledge packets. If the received link layer packet has a valid CRC and is an ack/nack, its sequence number SN is verified (AckNackTx). In case a valid acknowledge has been received, all transactions with sequence numbers not larger than SN can be purged from the replay buffer. Otherwise, transactions with larger numbers are retransmitted. If there were too many retransmissions (four) or no ack/nack packet was received, a link retraining command would be issued to the PHY layer. 5. Outbound flow control packets. After having read a transaction from the receive buffer and changed the receive flow control counters, the transmitter has to be updated. For this purpose, the link layer issues a flow update packet that is generated from the counter values provided by the TA layer. In the initialization state, init packets instead of updates are issued to the transmitter. 6. Inbound flow control packets. Inbound flow control packets are forwarded by the receiving link layer to the transaction layer. The TA layer updates its transmit flow control counters and schedules the next packet for transmission from the pending transaction buffer.
4.2.3
RapidIO RapidIO is a packet-oriented, point-to-point data transfer protocol. Like PCIExpress, RapidIO is a layered architecture [16]. The logical layer (our transaction layer) specifies the transaction models of RapidIO, that is, I/O, message passing, and global shared memory. The transport layer specifies the routing of packets through the network (our transaction layer covers the part which concerns end devices: the device identification). The physical layer defines the interface between two devices (our physical layer) as well as the packet transport and flow control mechanisms (our data link layer). Since there are only minor differences between PCI-Express and RapidIO, we refrain from presenting the complete implementation here. Instead, we list the differences of each layer. The transport layer implements a different buffer scheme with four prioritized transaction queues that are jointly flowcontrolled. At the link layer, an explicit acknowledge for each packet is required, whereas PCI Express allows the acknowledge of a packet sequence. The notacknowledge provides the cause of an error that is used for individual reactions at the transmitter. The PHY layer uses slightly different control symbols than PCI Express.
65
4
66
4.2.4
Towards a Flexible Network Processor Interface
Hypertransport Hypertransport is a parallel, packet-oriented, point-to-point data transfer protocol for chip-to-chip links [17, 18]. The most recent update [17] of the specification extends the protocol with communication system-specific features, such as link-level error recovery, message passing semantics, and direct peer-topeer transfer. In this chapter, we primarily use the preceding version described in Ref. [18]. Unlike RapidIO and PCI-Express, the packet transfer portion of a link comprises groups of parallel, unidirectional data signals with explicit clocks and an additional sideband signal to separate control from data packets. Control packets are used to exchange information, including the request and response transactions, between the two communicating nodes. Data packets that just carry the raw payload are always associated with a leading control packet. To improve the information exchange, the transmitter can insert certain independent control packets into a long data transfer. The base functionality of the Hypertransport protocol is comparable to PCIExpress and RapidIO. However, Table 4.2 reveals two main differences: (1) At the PHY layer, Hypertransport does not require framing, channel coding, and clock recovery due to the parallel interface; and (2) in nonretry mode,1 there is no acknowledge/not-acknowledge protocol at the link layer and a periodic CRC inserted every 512 transferred bytes is used. The Click implementation of an interface for a single-link end device is shown in Figure 4.4. In our model, we partition the Hypertransport protocol logically among our protocol layers as defined in Section 4.1.1, although layers are not used by the specification. Due to the absence of an ack/nack protocol, only four paths through the interface are important: 1. Outbound transactions. Similar to PCI-Express, a transaction is written into one of the transaction buffer queues (posted, nonposted, response). Transactions are forwarded depending on the priority and the availability of receiver space (flow control). When a transaction leaves, flow control counters are updated. The link layer performs the periodic CRC. The CRC value is inserted into the packet stream every 512 bytes. If necessary, the data link layer would interleave the current outgoing transaction with control packets, for example, for flow control. If there are neither waiting transactions nor
1. In retry mode [17] the link layer is more similar to PCI Express, using per-packet CRCs and an Ack/Nack protocol.
FIGURE
4.4
FC counters TaRx
Transaction layer
Response
TaRx
TaRx
Discard
Addr_validation
Timed Fl Gen
Data link layer
Data
Command
Paint(1)
has data
Classify Pack_Data
Discard
Gen
Paint(1)
Hypertransport end-point device interface.
Fl cntr++
Fl cntr++
Non-posted
Fl cntr++
Posted
Flow Mngr
Classify
Network processor core
Priority Sched TaFlTx
CRC32 insert
Prio Sched ChCRC32
Physical layer
Serialize DeSerialize
Paint: 1–Commands, 0–Data
Channel
Channel
Physical channel CheckP(0)
4.2 Modeling the Interfaces
67
4
68
Towards a Flexible Network Processor Interface
control packets, idle control packets are issued to the PHY layer. At the PHY layer, the packet is serialized and distributed according to the link width. 2. Inbound transactions. The PHY layer de-serializes the stream of incoming data into four-byte fragments, colors them as control or data, and sends them to the link layer. At the link layer, the CRC check is performed and the fragments are assembled to packets. After a classification step, transaction packets are passed on to the next layer. At the TA layer, the address check is performed that discards invalid transactions. Valid transactions are stored in the appropriate receive buffer queue, and the network processor core is notified. When the transaction is pulled from the queue, the receive flow control counters are updated and the transfer is completed. 3. Outbound flow control packets. The link layer issues so-called NOP packets that include flow control information provided by the transaction layer. In Hypertransport, only the counter differences (maximum two bits per flow counter) are transferred. During initialization, multiple packets are therefore necessary to transfer the absolute value of the receive buffer. 4. Inbound flow control packets. Inbound flow control information is forwarded by the receiving link layer to the transaction layer. The transaction layer updates its transmit flow control counters and schedules the next packet for transmission from the pending transaction buffer.
4.3
ARCHITECTURE EVALUATION In order to derive computational requirements, we describe static profiling results for two of the three discussed communication protocols. We look at PCI-Express as the most complex specification and Hypertransport, which is the least elaborate standard in our set. The goal of this section is to determine whether programmable solutions, such as existing packet-processing engines in network processors, are powerful enough to perform this task. Programmable solutions are in particular interesting for the implementation of the discussed standards since they provide us with a platform for all three protocols and allow us to adapt to late changes of the specifications. Remember that the processing of the investigated communication protocols is a peripheral service to the main network processor, that is, this peripheral is separate to the micro-architecture of the network processor, as it is currently deployed. This section will elaborate on the question, whether existing building block designs for network processors, such as processing engines, timers, and queue managers, can be used again in order to implement the peripheral functionality.
4.3
Architecture Evaluation
The assumptions on the micro-architecture of the processing engine, our mapping and implementation decisions, and our profiling procedure are described in the following subsections. We then discuss our results and reveal the sensitivity of our feasibility study on memory latency and processing engine speed.
4.3.1
Micro-Architecture Model We describe the simplified micro-architecture model of the packet-processing engine used for our performance analysis. As an application-specific instruction set processor targeted at network processing tasks, it is clear that we need specialized instructions for bit-level masking and logical operations [19]. We also assume support for several threads of execution and a large number of generalpurpose registers (GPRs) per thread. Intel’s current processing engine [20], for instance, provides 64 GPRs per thread. We therefore do not believe that the size of the register file is a limiting factor of our application. Indeed, our analysis showed fewer than 30 registers used concurrently in the worst case. We therefore do not discuss the influence of register spills in this context. The data path is assumed to be 32-bit as in all major network processors. The size of the code memory is not taken into account as a constraint. Although the size of the code memory was a concern for the early generations of network processors (e.g., see Ref. [21]), we do not believe that this constraint still holds for current engines, which incorporated greatly increased code memory areas due to former, rather negative design experience. Since we consider reliable protocols, the support for timers is mandatory in order to provide timed retransmissions. Hardware timers are implemented using special registers and can thus quickly be accessed.
4.3.2
Simplified Instruction Set with Timing The following instruction classes are used in order to statically derive the execution profile of our Click elements. An application-specific, register-to-register instruction set is assumed to support bit-level masks together with logical, arithmetic, load, and store operations to quickly access and manipulate header fields. ✦
Arithmetic operations (A). Additions and subtractions take one cycle.
✦
Logical operations (L). For example, and, or, xor, shift, and compare (cmp), take one cycle.
69
4
70 ✦
Towards a Flexible Network Processor Interface
Data transfer operations: — Load word (ldr) from memory: two cycles latency from embedded RAM (the influence of this parameter together with the latency for str will be discussed in the results subsection). — Load immediate (ldi), move between registers (mvr): take one cycle. — Store word (str) to memory: three cycles latency on embedded RAM.
✦
Branch instructions (B). Two cycles latency (no consideration of delay slots).
Our Click elements are annotated with a corresponding sequence of assembler instructions that are needed to perform the task of the element. These assembler programs of course depend on particular implementation and mapping decisions, which will be described next.
4.3.3
Mapping and Implementation Details Queue management. Although the Click model of computation assumes that full packets are handed over from element to element, a one-to-one implementation of this behavior will quickly overload implementations where a wide range of packet sizes must be supported. In the case of PCI-Express, packets can range from 6 to 4096 bytes. As it is done for IPv4 forwarding solutions—see Refs. [22] and [23] for two examples where queuing is supported in hardware—we therefore split the packet header from the payload and store them in separate queues. The packet descriptor, including the header, can be managed in statically reserved arrays, whereas the payload queues for PCI-Express need support for segments and linked lists in order to efficiently run enqueue and dequeue operations. By contrast, the maximum size of a Hypertransport packet is only 64 bytes. Therefore, we also use statically allocated arrays to manage Hypertransport payload since the amount of possibly wasted data memory is small. All queues are stored in the data memory of the packet-processing engine. The free list for payload segments is implemented as a stack of addresses of free segments. A segment contains 64 bytes of data plus one pointer to the next segment. The free list and the queues are implemented in separate memory areas which cannot overlap, that is, the free-list memory space cannot be used by the queues in times of congestion. This separation of memory areas greatly simplifies the management of the free list. Optimizations. Click elements mapped to the same thread of the processing engine share temporary registers, for example, a pointer to the current packet descriptor does not need to be explicitly transferred to the next Click element. Code from Click elements within push and pull chains without branches on the same thread can be concatenated so that jumps are avoided.
4.3
Architecture Evaluation
A natural partition would be to use one thread for receiving packets and another thread for transmissions. We will later derive that using multiple threads to hide memory latency and increase the utilization of the engine is not an option for our application domain. Finally, looking at the data transfer between the transaction layer and the application, we count only enqueue and dequeue operations for the packet descriptor at this level. Enqueue and dequeue operations for the payload can be hidden by, for instance, using a dual-port memory so that these transfers can occur concurrently to any other computations and operations on the RAM.
4.3.4
Profiling Procedure The static profiling of our Click elements is executed as follows: Given our models in Click, which are written in C++ and thus executable, we annotate each of the Click elements with the number of assembler instructions that the described hypothetical packet-processing engine would need to execute the element. An example is given in Figure 4.5. The code excerpt is part of the transaction layer flow control element. The method push_ta() handles the transmission of transactions to the data link layer. The method hasroom() is a helper method that checks the available space at the receiver by using the provided credits. The helper method is rather small, and we thus assume that push_ta() can use hasroom() in-line. Each basic block is commented with the number of required assembler instructions, following the implementation assumptions described in the preceding subsection. For each element, we choose a worst-case execution path and use the resulting number of instructions as the annotation of this element.2 Annotations may depend on the packet length and are parameterized accordingly. We then follow the different packet-processing paths in our Click models for receiving and transmitting packets, as derived in Section 4.2. The result is a histogram of executed assembler instructions together with the number of registers used for each of these cases. The profile also represents the execution time on one processing engine if weighted with the instruction latencies introduced earlier, assuming that there is no backlog in the queues. The execution time under a certain load could also be derived by assuming defined backlog levels in each of the participating queues of the Click model. As a result, our method is an estimation of the workload based on a static analysis of the control data flow graph, extracted from the source code of the 2. The example in Figure 4.5 is listed as flow ctrl in Table 4.3.
71
4
72
Towards a Flexible Network Processor Interface
bool CLASS::hasroom(unsigned int ta, unsigned int header_size; unsigned int data_size){ if ((header_size + _credit[ta][0][0] > _credit[ta][0][1]) //_credit contains only 12 values, && _credit[ta][0][1] //i.e., offset calc. is considered with one add return false; //3 add (2 offsets), 1 cmp, 1 and, 2 ldr (from credit ),1 branch if ((data_size + _credit[ta][1][0] > _credit[ta][1][1]) && _credit[ta][1][1]) return false; //3 add, 1 cmp, 1 and 2ldr, 1 branch _credit[ta][0][0] += header_size; //2 add (one for offset), 1 str (for_credit) _credit[ta][1][0] += data_size; //2 add, 1 str return true; //overall hasroom(): worst case: check //both if’s and update credits //10 add, 2 cmp, 2 and 4 ldr, 2 str, 2 branch //Less than 10 registers //(ta, header_size, data_size, _credit, 4_credit values) } bool CLASS::push_ta ( Packet *packet) { // extract packet type and size information ... //4 add, 1 shift, 1 ldi, 3 ldr, 1 branch // posted transactions if (type == 0x40 || type == 0x60 || (type & 0x38) == 0x30) h = hasroom(0, header_size, data_size); //3 cmp, 2 or, 1 branch, hasroom() else // nonposted transactions if ((type & 0x5e) == 0x00 || (type & 0x3e) == 0x04) h = hasroom(1, header_size, data_size); //2 cmp, 1 or, 1 branch, hasroom() else // completion if ((type & 0x3e) == 0x0a) h = hasroom(2, header_size, data_size); //1 cmp, 1 branch, hasroom() else { h = false; packet-> kill (); return (true); } //Overall push_ta(): if (h) //Worst-case: completion transaction output (OUT_TA).push(packet); //4 add, 1 shift, 6 cmp, 3 or, 1 ldi, 3ldr, 4 branch, hasroom() return(h); //Less than 10 registers (packet, type, header_size, data_size, h) } //Shareable with hasroom(): header_size, data_size
4.5
Derivation of the instruction histogram for the flow control Click element.
FIGURE
Click models. The main goal is to identify corner cases of the design, where certain architectural choices can be excluded from further investigation, even under best-case assumptions.
4.3.5
Results We begin with an investigation of the required number of operations for our application scenarios. We then take different design parameters into account in order to discuss the feasibility of implementing interconnect protocols on existing network engines.
4.3
Architecture Evaluation
Instruction class
73 A
L
ldr
ldi
str 2
B
Click element/subfunction, 64 byte data packet flow ctrl
14
14
7
1
Ack/Nack Tx
6
69
325
145
64
6
0
prio sched
0
1
1
0
0
0
framer
5
5
18
1
1
1
deframer
0
12
0
0
0
3
check paint
1
1
1
0
0
0
Ack/Nack Rx
67
322
148
64
1
1
classify
2
1
2
0
0
1
flow ctrl rx
8
8
6
3
5
0
AckNackGen
0
0
0
4
1
0
Ack/Nack Tx ack
7
9
2
0
1
1
Ack/Nack Tx nack
4
9
22
1
3
1
Ack/Nack packet-specific
Flow control packet-specific (in TaFl) flow ctrl update (Rx)
1
4
1
4
0
0
ctrl hasRoom (Tx)
10
4
4
0
2
2
ctrl newCredit (Tx)
8
2
3
0
2
3
64
320
144
64
0
0
descr enqueue
2
2
0
0
4
0
descr dequeue
2
2
4
0
0
0
payload enqueue
3
3
1
1
16
0
payload dequeue
3
2
16
1
1
0
common Calc CRC
4.3
Profile for PCI-Express Click elements.
TA B L E
Profiling Click Elements. The instruction execution profile using the instruction classes introduced earlier for the major Click elements for PCIExpress are listed in Table 4.3. The profiles listed in the table reflect the computation requirement for one packet. We assume that one packet must be retransmitted if a Nack packet is
4
74
Towards a Flexible Network Processor Interface
received, and that the priority scheduler has to check only one queue for its size. The requirement for the scheduler listed in the table is low, since enqueue and dequeue operations are counted separately. On the other hand, Ack/Nack Tx/Rx counts are high, since they include the calculation of the CRC. Classifications are simple, since they rely only on small bit fields for packet and transaction types, so that they can be implemented using table lookup. Although only one data segment is required in this example, we already recognize the overhead for managing the payload queue compared with the descriptor queue. This is because the payload queue requires a free-list and the adjustment of pointers. The profile for the calculation of the CRC is dependent on the length of the packet. Although we looked at a rather small packet size, we recognize that the calculation of the CRC is by far the most dominant element. However, CRC units are usually implemented in hardware and efficient solutions exist [24] and are being used in network processors. We therefore believe that it is safe to exclude the calculation of the CRCs from further investigation in this section. The same statement is true for the Framer/Deframer functionality. Although much less demanding, the Framer is still as complex as flow-control operations, but can easily be implemented in hardware due to very regular functionality. The analysis of the Hypertransport elements leads to the same conclusions, which is why we provide only major differences to the PCI-Express implementation in Table 4.4. Since 64-byte packets are the maximum length for Hypertransport, we decided to implement the payload queue as an array of 64 byte elements. This is why the requirements are slightly smaller for enqueue and dequeue operations compared with PCI-Express. Sensitivity on RAM Access and Clock Speed. Given the profiling results in the form of instruction histograms, we now derive execution times for different hardware scenarios by weighting the histograms with the corresponding execution latencies per instruction class. Since the interaction with the data memory is fine-granular, the RAM timing is particularly important. We therefore have a look at three scenarios: (1) ideal memory that returns data within one
4.4 TA B L E
Instruction class
A
L
ldr
ldi
str
B
flow ctrl
4
4
2
3
1
0
addr validation
1
3
1
0
0
0
payload enqueue
2
2
0
0
15
0
payload dequeue
2
2
15
0
0
0
pack data
3
9
3
0
18
0
Execution profile for Hypertransport Click elements, 64-byte data packet.
Y FL
4.3
Architecture Evaluation
M A E T
75
cycle (i.e., the ldr and str latency is one cycle); (2) on-chip SRAM running at the core speed, e.g., employing DDR interfaces, using the latency values two and three clock cycles for reads and writes, respectively; and (3) off-chip memory with a ten-cycle access latency, representing off-chip SRAM and nonburst accesses. As a comparison base, we assume that our microengine runs at 1 GHz. We calculate the ratio between the execution time to the interarrival time of the packet at 2.0 GB goodput, which is shown on the vertical axis in the Figure 4.6. This ratio can be interpreted as the necessary pipeline depth of processing elements to fulfill the processing requirements under the ideal assumption that the program could be evenly distributed among the processing elements. We calculate the execution times in dependence on different packet sizes. As argued earlier, we exclude CRC calculations and Framer functionality from the analysis. The results for PCI-Express for the most complex application scenario (reception of a data packet, case B) are plotted in Figure 4.6a. We limit the max. packet size to 2048 bytes, which is sufficient to encapsulate and transfer IP packets. This figure clearly emphasizes that a tight interaction with the memory is required to get a feasible solution. Due to the small packet sizes compared with IPv4 forwarding, a single processing element can keep up with line speed only for packet sizes of 32 bytes or larger using on-chip RAMs. Using PCI-Express at the interface to the control plane, however, implies that these short messages do not appear frequently compared with forwarding throughput. On the other hand, if PCI-Express is used as an interface to the switch fabric, 64-byte segments are a reasonable assumption. As a rule of thumb, we see that one packet-processing engine at 1 GHz is required to cope with one lane at 2.5 GB using on-chip
(a)
(b) 9
4
8
6 5 4 3
2.5 2 1.5
2
1
1
0.5
0 8
4.6 FIGURE
16
32
64 128 256 512 1024 Packet length [byte]
500 MHz 1 GHz 2 GHz
3 Pipeline depth
Pipeline depth
3.5
Ideal RAM On-chip RAM Off-chip RAM
7
2048
0 16
32
64
128 256 512 Packet length [byte]
1024
2048
Relative execution time for PCI-Express, depending on (a) packet length and RAM access time, and (b) packet length and CPU speed.
4
76
Towards a Flexible Network Processor Interface
Case A
Transmission of data packet
123
Case B
Reception of a data packet
154
Case C
Transmission of Ack/Nack packet
Case D
Reception of Ack/Nack packet
Case E
Transmission of flow control packet
Case F
Reception of flow control packet
41 63/107 41 107
Cycle counts for PCI-Express transfer scenarios, on-chip RAM, data packet length 64 bytes.
4.5 TA B L E
memories. We also recognize that the access times to the RAM must be small and therefore multithreading is not an option for hiding latency. Using off-chip DRAM would render the effects only worse. Our results show that already off-chip SRAM is not a feasible option. Figure 4.6b shows the impact of CPU speed on the pipeline depth, assuming on-chip RAM. We see that two processing elements at 500 MHz are the corner case to support 32-byte packets. The actual balanced partitioning of functionality onto several processing engines is not trivial and can be achieved only approximately, worsening the performance for pipelined solutions even further. The discussion of these issues is beyond the scope of this chapter [25, 26]. For completeness, the cycle counts for all transfer scenarios for PCI-Express (see Section 4.2.2) are listed in Table 4.5 for a data packet length of 64 bytes and on-chip RAM. Cases A, B, and D depend on the data packet size. In Case D, a retransmission of one data packet is initiated if a Nack packet is received. The cycle counts for Hypertransport, for the supported cases A, B, E, and F, are within a 15 percent margin of the PCI-Express results, looking at packet sizes up to 64 bytes. We thus refrain from discussing the results for Hypertransport in detail.
4.3.6
Discussion Comparing our design experience [25] for IPv4 forwarding solutions with the processing demands for the investigated chip- and board-level interconnection standards, we conclude: ✦
An application-specific instruction set targeted at network processing is also beneficial and mandatory for programmable solutions in our application domain. Header field accesses appear frequently and require bit-level operations, although the number of possible header fields is considerably smaller than in the IPv4 case.
4.4
Conclusions
✦
A relatively small number of source and destination addresses is required to support the discussed standards. Possible paths between sources and destinations are almost static and only require few updates at run-time, for example, when a hardware failure occurs in one part of the system. That means route lookups are not a limiting factor.
✦
Possibly smaller segments and minimum packet sizes must be supported, for example, a packet can be as small as 6 bytes, roughly one-tenth of a minimum IP packet. However, queue management building blocks for freelists and FIFOs can be reused from existing network processors, for example, the BMU and QMU blocks in Motorola’s C-Port C5 [22], to implement the peripheral functionality.
✦
The CRC has a data-dependent execution time and is an ideal candidate for a hardware implementation, as already seen in current network processors.
✦
Timers are needed to provide timed retransmissions on the transaction layer.
✦
A tight coupling between the data RAM and the processing engine is required to support minimum-size packets with a feasible pipelining depth of processing engines. This also implies, however, that multiple threads can no longer be used to hide memory latency.
4.4
CONCLUSIONS The goal of this work is to clarify trade-offs involved with implementing the three most popular chip- and board-level interconnect standards, namely PCIExpress, Hypertransport, and RapidIO. We have identified several application scenarios in the domain of network processing for this purpose. In order to understand common functionality and properties specific to certain solutions, we have modeled the behavior in Click. We have evaluated the feasibility of implementing the communication standards on existing network processor infrastructure by a static profiling analysis. The results of our study are as follows: ✦
After adapting Click to multiple-rate processing and arbitrary data types, we believe that Click is a natural and expressive modeling formalism to specify the functionality of the investigated communication standards. Our analysis has been particularly eased by using the abstraction level of Click elements and their interaction. Apart from having an executable specification, it is a good way to graphically document communication layers.
✦
Differences in the processing of packet content can mostly be found in the data link and transaction layers, affecting traffic class distinction and flow
77
4
78
Towards a Flexible Network Processor Interface
control. The variations are rather subtle and, more importantly, are localized to Click elements, whereas the overall packet flow is the same among the interconnect standards, that is, they provide basically the same services to upper layers. This promotes the use of flexible, yet application-specific hardware to perform the packet processing. ✦
The evaluation of network processor building blocks as a programmable platform revealed the feasibility of implementing the required processing on existing flexible solutions. However, due to relatively small packet sizes compared with IPv4 forwarding, a tight memory coupling to the processing core becomes the pivotal element of performance due to frequent accesses to header fields.
✦
Future network processors will be augmented with a high number of processing elements. Our analysis shows that the interconnect protocol processing can run on a small number of these elements, thus voiding the need for application-specific integrated circuits. Existing building blocks in network processors, that is, buffer managers and hardware timers, can be reused to implement a flexible peripheral.
In conclusion, our results encourage a flexible solution for a combined highspeed serial communication interface that is based on building blocks for network processor engines.
ACKNOWLEDGMENTS This work was supported by Infineon Technologies, the Microelectronics Advanced Research Consortium (MARCO), the Spanish government grant TIC 2002-750, and is part of the efforts of the Gigascale Systems Research Center.
REFERENCES [1]
M. Tsai, C. Kulkarni, C. Sauer, N. Shah, and K. Keutzer, “A benchmarking methodology for network processors,” Network Processor Design: Issues and Practices, volume 1, pp. 141–165, Morgan Kaufmann, 2002.
[2]
I. Elhanany, K. Busch, and D. Chiou, “Switch fabric interfaces,” IEEE Computer, September 2003.
[3]
R. Merritt, “Intel pushes convergence, advanced switching,” EE Times, October 2003.
[4]
M. Levy, “Motorola’s embedded PowerPC story,” Microprocessor Report, August 2002.
References
79
[5]
D. Bees and B. Holden, “Making interconnects more flexible,” EE Times, September 2003.
[6]
N. Cravotta, “RapidIO versus Hypertransport,” EDN, June 2002.
[7]
R. Recio, “Server I/O networks past, present, and future,” ACM SIGCOMM Workshop on Network-I/O Convergence, pp. 163–178, August 2003.
[8]
D. Foland, “Ubicom MASI—wireless network processor,” 15th Hotchips Conference, Palo Alto, California, August 2003.
[9]
R. Merritt, “Dueling I/O efforts gear up to revamp comms,” EE Times, October 2003.
[10]
E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. Kaashoek, “The Click modular router,” ACM Transactions on Computer Systems, 18(3), August 2000.
[11]
B. Chen and R. Morris, “Flexible control of parallelism in a multiprocessor PC router,” Proceedings of the 2001 USENIX Annual Technical Conference (USENIX ’01), pp. 333–346, Boston, Massuchetts, June 2001.
[12]
N. Shah, W. Plishker, and K. Keutzer, “NP-Click: A programming model for the Intel IXP1200,” Network Processor Design: Issues and Practices, volume 2, pp. 181–201 Morgan Kaufmann, 2003.
[13]
R. Budruk, D. Anderson, and T. Shanley, PCI Express System Architecture. Addision-Wesley, 2003.
[14]
PCI Special Interest Group, PCI Express base specification, rev. 1.0a, www.pcisig.com, April 2003.
[15]
Advanced Switching Special Interest Group, PCI Express Advanced Switching Specification, Draft 1.0, www.asi-sig.org, September 2003.
[16]
RapidIO Trade Association, RapidIO interconnect specification, rev. 1.2, www.rapidio.org, June 2002.
[17]
Hypertransport Technology Consortium, HyperTransport I/O link specification, rev. 1.10, www.hypertransport.org, August 2003.
[18]
J. Trodden and D. Anderson, HyperTransport System Architecture, Addision-Wesley, 2003.
[19]
X. Nie, L. Gazsi, F. Engel, and G. Fettweis, “A new network processor architecture for high-speed communications,” Workshop on Signal Processing Systems (SiPS), pp. 548–557, October 1999.
[20]
P. Chandra, S. Lakshmanamurthy, and R. Yavatkar, “Intel IXP2400 network processor: A 2nd generation Intel NPU,” Network Processor Design: Issues and Practices, volume 1, pp. 259–275, Morgan Kaufmann, 2002.
[21]
Y.-D. Lin, Y.-N. Lin, S.-C. Yang, and Y.-S. Lin, “DiffServ edge routers over network processors: Implementation and evaluation,” IEEE Network, August 2003.
[22]
G. Giacalone, T. Brightman, A. Brown, J. Brown, J. Farrelland, R. Fortino, T. Franco, A. Funk, K. Gillespie, E. Gould, D. Husak, E. McLellan, B. Peregoy, D. Priore, M. Sankey, P. Stropparo, and J. Wise, “A 200 MHz digital communications processor.” IEEE International Solid-State Circuits Conference (ISSCC), pp. 416–417, February 2000.
4
80
Towards a Flexible Network Processor Interface
[23]
G. Kornaros, I. Papaefstathiou, A. Nikologiannis, and N. Zervos, “A fully-programmable memory management system optimizing queue handling at multi gigabit rates,” 40th Conference on Design Automation (DAC), pp. 54–59, June 2003.
[24]
T. Henriksson, H. Eriksson, U. Nordqvist, P. Larsson-Edefors, and D. Liu, “VLSI implementation of CRC-32 for 10 gigabit Ethernet,” 8th IEEE International Conference on Electronics, Circuits and Systems (ICECS), pp. 1215–1218, September 2001.
[25]
C. Kulkarni, M. Gries, C. Sauer, and K. Keutzer, “Programming challenges in network processor deployment,” International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), pp. 178–187, October 2003.
[26]
M. Gries, C. Kulkarni, C. Sauer, and K. Keutzer, “Exploring trade-offs in performance and programmability of processing element topologies for network processors,” Network Processor Design: Issues and Practices, volume 2, pp. 133–158, Morgan Kaufmann, 2003.
5 CHAPTER
A High-Speed, Multithreaded TCP Offload Engine for 10 Gb/s Ethernet Yatin Hoskote, Sriram Vangal, Vasantha Erraguntla, Nitin Borkar Microprocessor Technology Labs, Intel Corporation, Hillsboro, OR 97124, USA
Transmission Control Protocol (TCP) is a connection-oriented reliable protocol accounting for over 80 percent of network traffic. Today TCP processing is performed almost exclusively through software. Even with the advent of GHz processor speeds, there is a need for a dedicated TCP offload engine (TOE) in order to support high bandwidths of 10 Gb/s and beyond [1]. Several studies have shown that even state-of-the-art servers are forced to completely dedicate their CPUs to TCP processing when bandwidths exceed a few Gb/s [2]. At 10 Gbs, there are 14.8 M minimum-size Ethernet packets arriving every second, with a new packet arriving every 67.2 ns. Allowing a few nanoseconds for overhead, wire-speed TCP processing requires several hundred instructions to be executed approximately every 50 ns. Given that a majority of TCP traffic is composed of small packets [3], this is an overwhelming burden on the CPU. A generally accepted rule of thumb for network processing is that 1 GHz CPU processing frequency is required for a 1 Gb/s Ethernet link. For smaller packet sizes on saturated links, this requirement is often much higher [4]. Ethernet bandwidth is slated to increase at a much faster rate than the processing power of leading edge microprocessors. Clearly, general-purpose MIPS will not be able to provide the required computing power in coming generations. One solution is to provide hardware support in the form of a TOE to offload some of this processing from the CPU. Figure 5.1 shows a computer system consisting of a processor with a memory unit, a chipset, and a network interface card (NIC). The TOE can be physically part of the processor, the chipset, or the NIC. Of the three possible options, the TOE as part of the chipset would provide better access to host memory.
5
82
A High-Speed, Multithreaded TCP Offload Engine for 10 Gb/s Ethernet
CPU
CPU
Front side bus TOE
Graphics AGP
Memory bridge
Memory
USB1.1 HDD
ATA
I/O Bridge
Local I/O
PCI NIC
5.1
TOE in system.
FIGURE
In addition to high-speed protocol processing requirements, the efficient handling of Ethernet traffic involves addressing several issues at the system level, such as transfer of payload and management of CPU interrupts [5]. This implies that a TCP offload engine is only part of the solution and a system-level approach has to be taken to achieve significant improvements in TCP processing. These issues are described in Section 2. Our approach incorporates a high-speed processing engine with a DMA controller and other hardware assist blocks as well as system-level optimizations. A detailed architectural description of this solution is given in Section 3. Results of preliminary performance analysis to gauge the capability of this architecture are described in Section 4. A proof of concept version of a processing engine that can form the core of this TCP offload solution has been developed (see Figure 5.2) [6]. The goal was to design and build an experimental chip that can handle the most stringent requirements—wire-speed inbound processing at 10 Gb/s on a saturated wire with minimum size packets. Another priority was to ensure that the design cycle was short by keeping the design simple, flexible, and extensible. As opposed to a solution that uses a general-purpose processor dedicated to TCP processing, this
5.1
Requirements on TCP Offload Solution
83
PLL
Exec Core
ROM
Input seq
5.2
CLB
TCB
ROB
Send buffer
Chip area process
2.23 × 3.54 mm2 90 nm dual-VT CMOS Interconnect 1 poly, 7 metal Transistors 460 K Pad count 306
Proof-of-concept TOE chip.
FIGURE
was a special-purpose processor targeted at this task. In order to adapt quickly to changing protocols, the chip was designed to be programmable. This approach also simplified the design and reduced the validation phase as compared to fixed-state machine architectures. Additionally, the specialized instruction set significantly reduced the processing time per packet. The chip was architected so that it is possible to easily scale down the highspeed execution core without any redesign if the processing requirements in terms of Ethernet bandwidth or minimum packet size are relaxed. This core successfully demonstrated TCP input processing capability exceeding 9 Gb/s (see Figure 5.3). Key learnings from this experiment greatly influenced the architecture and design of the TOE presented here.
5.1
REQUIREMENTS ON TCP OFFLOAD SOLUTION There are several steps in TCP termination that require improvement if future increases in bandwidth are to be handled efficiently: 1. Minimize intermediate copies of payload. Currently, the use of intermediate copies of data during both transmit and receive is a significant performance bottleneck. As shown in Figure 5.4, the data to be transmitted has to
5
Processing rate (Gbps)
84
11 10 9 8 7 6 5 4 3 2 1 0 0.8
A High-Speed, Multithreaded TCP Offload Engine for 10 Gb/s Ethernet
9.64 Gbps, 1.72 V
4.4 Gbps, 0.9 V
0.9
1
1.1
1.2
1.3 1.4 Vcc (V)
7
1.5
1.6
1.7
1.8
6.39 W, 1.72 V
6 Power (W)
5 4 3 2
0.73 W, 0.9 V
1 0
5.3
2
3
4
5 6 7 8 Processing rate (Gbps)
9
10
11
Measured proof-of-concept chip results.
FIGURE
be copied from the application buffer to a buffer in OS kernel space. It is then moved to buffers in the NIC before being sent out on the network. Similarly, data that is received has to be first stored in the NIC, then moved to kernel space, and finally copied into the destination application buffers. A more efficient mechanism of transferring data between application buffers and the NIC is sorely needed, both to improve performance and to reduce traffic on the front-side bus. Requiring the application to preassign buffers for data that it expects to receive would facilitate efficient data transfer. 2. Mitigate the effect of memory accesses. Processing transmit and receive requires accessing context information for each connection that may be stored in
FIGURE
5.4
CPU
...
CPU
Socket buffer
3
Host memory
App buffer
2
DMA
MCH ENET CTLR
2
1. DMA write
Host memory
3. CPU write
Socket buffer
1
ENET CTLR
3. DMA read
App buffer
3
DMA
MCH
PCIExpress
2. CPU read
CPU
...
CPU
FSB
Software receive path
2. CPU write
1. CPU read
Data transfer in current systems.
1
FSB
PCIExpress
Software transmit path
5.1 Requirements on TCP Offload Solution
85
5
86
3.
4.
5.
6.
A High-Speed, Multithreaded TCP Offload Engine for 10 Gb/s Ethernet
host memory. Each memory access is an expensive operation, which can take up to 100 ns. Optimizing the TCP stack to reduce the number of memory accesses would significantly increase performance. At the same time, use of techniques to hide memory latency, such as multithreading, would make more efficient use of compute resources. Provide quick access to state information. The context information for each TCP connection is of the order of several hundred bytes. Some method of caching the context for active connections is necessary. Studies have shown that caching context for a small number of connections is sufficient (burst mode operation) to see performance improvement [5]. Increasing the cache size beyond that does not help unless it is made large enough to hold the entire allowable number of connections. Protocol processing requires frequent and repeated access to various fields of each context. A mechanism, such as fast local registers, to access these fields quickly and efficiently, reduces the time spent in protocol processing. In addition to context information, these registers can also be used to store intermediate results during processing. Optimize instruction execution. Reducing the number of instructions to be executed by optimizing the TCP stack would go a long way in reducing the processing time per packet. The best performance-power trade-off (MIPS/Watt) is achieved by a special-purpose high-speed engine, that is targeted for this task [6]. This engine should be programmable to adapt easily to changing protocols. A specialized instruction set allows this engine to be designed in an optimal fashion, while at the same time providing instructions geared for efficient TCP processing. Examples are instructions to do hash table lookups and CAM lookups. Streamline interfaces between the host, chipset, and NIC. Another source of overhead that reduces host efficiency is the communication interface between host and NIC. For instance, an interrupt-driven mechanism tends to overload the host and adversely impact other applications running on the host. Provide hardware assist blocks for specific functions. Trade-offs must be made between implementation of functions in hardware or software by weighing the performance advantages against the increased power dissipation and chip size. Examples of functions that can be done by special purpose hardware blocks are encryption and decryption, classification, and timers.
The following section describes the architecture of our TOE solution and how it addresses these issues.
5.2
Architecture of TOE Solution
87
Host interface CQ
Host memory interface
EQ Segmented TCB cache (1 MB) 1.2 GHz
DBQ
Mem queue
Engine 4.8 GHz
Timer
Scheduler 1.2 GHz
Tx DMA
Rx DMA
Tx Queue
5.5
Hdr and data queue
V2P
NIC interface
Data Control
TOE architecture overview.
FIGURE
5.2
ARCHITECTURE OF TOE SOLUTION Many of the techniques described here have been used to good advantage in the design of network processors. However, network processors re-targeted for TCP termination would be less efficient in terms of performance per watt [7].
5.2.1
Architecture Details A top-level architecture diagram of the proposed TOE is shown in Figure 5.5. The design provides well-defined interfaces to the NIC, host memory, and
88
5
A High-Speed, Multithreaded TCP Offload Engine for 10 Gb/s Ethernet
the host CPU. The architecture uses a high-speed processing engine at its core, with interfaces to the peripheral units. A dual frequency design is used, with the processing engine clocked several times faster (core clock) than the peripheral units (slow clock). This approach results in minimal input buffering needs, enabling wire-speed processing. The design uses 1 MB of on-die cache to store TCP connection context, which provides temporal locality for 2 K connections, with additional contexts residing in host memory. The context is a portion of the transmission control block (TCB) that TCP is required to maintain for each connection. Caching this context on-chip is critical for 10 Gb/s performance. In addition, to avoid intermediate packet copies on receives and transmits, the design includes an integrated direct memory access (DMA) controller. This enables a low latency transfer path and supports direct placement of data in application buffers without substantial intermediate buffering. A central scheduler provides global control to the processing engine at a packet level granularity. A transmit queue buffers NIC bound packets. The architecture presents three queues as a hardware mechanism to interface with the host CPU [8]. An inbound doorbell queue (DBQ) is used to initiate send (or receive) requests. An outbound completion queue (CQ) and an exception/event queue (EQ) are used to communicate processed results and events back to the host. A timer unit provides hardware offload for four of seven frequently used timers associated with TCP processing. The TOE includes hardware assist for virtual to physical (V2P) address translation. The DMA engine supports four independent, concurrent channels and provides a low-latency/high-throughput path to/from memory. The TOE constructs a list of descriptors (commands for read and write), programs the DMA engine, and initiates the DMA start operation. The DMA engine transfers data from source to destination as per the list. Upon completion of the commands, the DMA engine notifies the TOE, which updates the completion queue to notify the host. A micro-architecture block diagram of the processing engine is detailed in Figure 5.6, and features a high-speed fully pipelined ALU at its heart, communicating with a wide working register. TCB context for the current scheduled active connection is loaded into the 512 B wide working register for processing. The execution core performs TCP processing under direction of instructions issued by the instruction cache. A control instruction is read every core cycle and loaded into the instruction register (IR). The execution core reads instructions from the IR, decodes them if necessary, and executes them every cycle. The functional units in the core include arithmetic and logic units, shifters and comparators—all optimized for high-frequency operation. The core includes
5.2
Architecture of TOE Solution
89 4K Thread Cache TCB Cache
Core Rx queue
512 B working register Next address Branch address Start address 32
256 B Scratch registers
32 PC
Pipelined ALU
256 B
Decode IR To CQ/EQ
ALU result
Data
5.6
32K I-Cache
Scheduler control
Processing engine architecture.
FIGURE
a large register set, two 256 B register arrays to store intermediate processing results. The scheduler exercises additional control over execution flow. In an effort to hide host and TCB memory latency and improve throughput, the engine is multithreaded. The design includes a thread cache, running at core speed, which allows intermediate architecture state to be saved and restored. The design also provides a high-bandwidth connection between the thread cache and the working register making possible very fast and parallel transfer of thread state between the working register and the thread cache. Thread context switches can occur during both receives and transmits and when waiting on outstanding memory requests or on pending DMA transactions. Specific multithreading details are described in a later section. The engine features a cacheable control store, which enables only code relevant to specific TCP processing steps to be cached, with the rest of the code in host memory. A good replacement policy allows TCP code in the instruction cache to be swapped as required. This also provides flexibility and allows for easy protocol updates.
90
5
A High-Speed, Multithreaded TCP Offload Engine for 10 Gb/s Ethernet
Input processing TOE inbound packets from the NIC are buffered in header and payload queue. A splitter parses to separate packets payload from the header and forward the header to the scheduler unit. The execution engine performs a hash-based table lookup against the TCB cache, and on a hit, loads the context into the working register in the execution core. On a cache miss, the engine queues a host memory lookup. The execution engine performs the heart of TCP input processing under programmed control at high speed. The processing steps are summarized in the pipeline diagram in Figure 5.7a. The core also programs the DMA control unit and queues the receive DMA requests. Payload data is transferred from internal receive buffers to preposted locations in host memory using DMA. This low latency DMA transfer is critical for high performance. Careful design allows the TCP processing to continue in parallel with the DMA operation. On completion of TCP processing, the context is updated with the processing results and written back to the TCB cache. The scheduler also updates CQ with the completion descriptors and EQ with the status of completion, which can generate a host CPU interrupt. This queuing mechanism enables events and interrupts to be coalesced for more efficient servicing by the CPU. The core also generates acknowledgement (ACK) headers as part of processing.
Output processing The host places doorbell descriptors in DBQ. The doorbell contains pointers to transmit or receive descriptors’ buffers, which reside in host memory. The TOE is responsible for fetching and loading the descriptors in the TCB cache. The output processing steps are summarized in the pipeline diagram in Figure 5.7b. Scheduling a lookup against the local TCB identifies the connection with the corresponding connection context being loaded into the core working register, starting core processing. The core programs the DMA control unit to queue the transmit DMA requests. This provides autonomous transfer of data from payload locations in host memory to internal transmit buffers using DMA. Processed results are written back to the TCB cache. Completion notification of send is accomplished by populating CQ and EQ to signal end of transmit. In addition to the generic instructions supported by this TOE (see Figure 5.8), a specialized instruction set was developed for efficient TCP processing. It includes special-purpose instructions for accelerated context lookup, loading, and write back. These instructions enable context loads and stores
5.2
Architecture of TOE Solution
(a)
91
Process NIC descr Chk valid/IP lkup TCP chk NIC processing A
B
C
D
Read hash tbl Read context Work Q Rd Timer read Parse TCP hdr TOE processing
TCB/Host memory access E DMA setup
F
Sched ACK
TCP proc
(TOE -> DMA)
Post IntrQ Update context
TOE processing continues Start DMA
TCP stat++
Wrap-up processing End DMA
Rcv data Payload DMA
Rx A
(b)
B
C Route valid?
Read Doorbell Read TxQ Read context
Read Route
(Host-> TOE) TCB/Host memory access
TCP Seq #
TOE processing
Gen TCP Hdrs
DMA setup
TOE processing E
F
Update context
Update CQ/EQ
D RT Timers
Compute chksum TOE processing continues
Start DMA
Tx data
Wrap-up processing
End DMA
Payload DMA Tx
5.7 FIGURE
Packet processing pipeline on (a) packet receive and (b) packet transmit.
5
92
A High-Speed, Multithreaded TCP Offload Engine for 10 Gb/s Ethernet
General purpose LOADA ← data MOV A → B AND A B → cond
Special purpose Category
Instructions
OR A B → cond ADD A B → C
Context access
TCBRD, TCBWR
Hashing
HSHLKP, HSHUPDT
SUB A B → C CMP A B → cond
Multithreading
THRDSV, THRDRST
EQUAL A B → cond NOT A → C BREQZ / BRNEQZ label
DMA commands DMATX, DMARX Timers
TIMERRD, TIMERRW
Network to host byte order
HTONS, HTONL, NTOHL, NTOHS
JMP label SHL2 A NOP
5.8
TOE instruction set.
FIGURE
from TCB cache in eight slow cycles, as well as 512 B-wide context read and write between the core and the thread cache in a single core cycle. The specialpurpose instructions include single-cycle hashing, DMA transmit and receive instructions, and timer commands. Hardware assist for conversion between host and network byte order is also available. The generic instructions operate on 32-bit operands.
5.2.2
TCP-Aware Hardware Multithreading and Scheduling Logic A multithreaded architecture enables hiding of latency from memory accesses and other hardware functions and thus expedites inbound and outbound packet processing, minimizing the need for costly buffering and queuing. Hardwareassisted multithreading would enable storage of thread state in private (local) memory. True hardware multithreading takes this a step further by implementing the multiple thread mechanism entirely in hardware. A TCP-aware scheduler handles the tasks of thread suspension, scheduling, synchronizing, and save/restore of thread state and the conditions that trigger them. TCP stack analysis shows that there are a finite number of such conditions, which
5.2
Architecture of TOE Solution
could be safely moved to hardware. The motivation is to free the programmer from the responsibility of maintaining and scheduling threads and to mitigate human error. This model is thus simpler than the more common model of a programmer- or compiler-generated multithreaded code. In addition, the same code that runs on a single-threaded engine can run unmodified on this engine with greater efficiency. The overhead penalty from switching between threads is kept minimal to achieve better throughput. The architecture also provides instructions to support legacy manual multithreaded programming. Hardware multithreading is best illustrated with an example. TCP packet processing requires several memory accesses as well as synchronization points with the DMA engine that can cause the execution core to stall, while waiting for a response from such long latency operations. Six such trigger conditions are identified (A–F) in pipeline diagrams in Figure 5.7. If core processing completes prior to DMA, thread switch can occur to improve throughput. When DMA ends, the thread switches back to update the context with processed results and the updated context is written back to the TCB. Thread switches can happen on both transmit and receive processing. Unlike typical multithreading where thread switch, lock/unlock, and yield points are manually controlled, the TCP-aware scheduler controls the switching and synchronization between different threads in all the above cases. A single thread is associated with each network packet that is being processed, both incoming and outgoing. This differs from other approaches that associate threads with each task to be performed, irrespective of the packet. The scheduler spawns a thread when a packet belonging to a new connection needs to be processed. A second packet for that same connection will not be assigned a thread until the first packet is completely processed and the updated context has been written back to TCB. This is under the control of the scheduler. When the processing of a packet in the core is stalled, the thread state is saved in the thread cache and the scheduler will spawn a thread for a packet on a different connection. It could also wake up a thread for a previously suspended packet by restoring its state and allow it to run to completion. In this approach, the scheduler also spawns special maintenance threads for global tasks like gathering statistics on Ethernet traffic. The priority mechanism to determine which packet to schedule next is programmed into the scheduler. The scheduler has to arbitrate between events that wake up or spawn threads from the following categories (as shown in Figure 5.9): 1. New packets on fresh connections, or on existing connections with no active packets in the engine.
93
5
94
A High-Speed, Multithreaded TCP Offload Engine for 10 Gb/s Ethernet
Core control
Completion events queue New packets queue
Control (finite state machine)
Maintenance events queue
5.9
Scheduler block diagram.
FIGURE
2. New packets on existing network connections with active packets in the engine. 3. Completion events for suspended threads. 4. Maintenance and other global events. The design provides a tightly coupled 4 KB thread cache, running at core speed, which enables intermediate architecture state to be saved and restored for each thread. This cache is 8 threads deep and 512 B wide. The width of the cache is determined by the amount of context information that needs to be saved for each packet. The depth of the cache is determined by the packet arrival and completion rates. Analysis shows that for 256-byte packets on a 10 Gb/s link for performing both receives and transmits, an 8-deep cache is sufficient because that is more than the number of packets that could be active at any point in time. The high-bandwidth connection between the thread cache and the working register ensures that the overhead penalty from thread switches is minimal. At the design frequencies shown here, the overhead penalty per switch is about 2 ns. The working register, execution core, and scratch registers are completely dedicated to the packet currently being processed. This is again different from other approaches where the resources are split up a priori and dedicated to specific threads. This ensures adequate resources for each packet without having to duplicate resources and increase engine die area. Efficient multithreading is critical to the ability of the offload engine to scale up to multigigabit Ethernet rates. The design and validation of the TOE is simpler in this approach than conventional approaches to multithreading. It also simplifies requirements on the compiler and the programming model.
5.3
Performance Analysis
5.3
95
PERFORMANCE ANALYSIS The TOE described here has been architected for efficient TCP termination in the chipset. An analysis of the performance of such a system will give some indication of its capability in terms of full duplex Ethernet bandwidth for particular packet sizes. We performed such a preliminary analysis for both receive and transmit fast paths. Results for the receive path are given here. With the TOE in the chipset as shown in Figure 5.1 and a dedicated DMA engine for payload transfer, we compute the packet latency from NIC interface to CPU host. The associated individual latencies used in this analysis are shown in Table 5.1, with host memory latencies obtained from Ref. [9]. Because of the multithreaded architecture, we compute two metrics for each packet: (1) the packet latency or turnaround time per packet, which corresponds to throughput for a single-threaded design and (2) the packet throughput for a multithreaded design. The main components are as follows: 1. 2. 3. 4.
Instruction execution Memory accesses Thread-switching penalty DMA data transfer
To compute instruction execution time, the total number of instructions executed in the fast path using the specialized instruction set is calculated to be 200, with cycles per instruction (CPI) of 1.5, and core frequency of 4.8 GHz. The primary reason for the small number of instructions is the use of the specialized instruction set and the large number of local registers. The DMA transfer is done concurrently with other processing as shown in Figure 5.7. Since we are focusing on small packets, it is safe to assume that DMA setup time dominates over the data transfer time. How we account for memory accesses differentiates throughput from latency. The hash table entries and connection context for an incoming packet can either
5.1 TA B L E
TOE – Host memory
20 ns for 1st 16 B, 5 ns for each additional transfer
TOE – TCB
4 ns for 1st 64 B, 1 ns for each additional transfer
TOE – DMA engine
10 ns for setup
Latency assumptions.
5
96
A High-Speed, Multithreaded TCP Offload Engine for 10 Gb/s Ethernet
be in the TCB or in host memory. We use a conservative hit rate of 50 percent in the TCB to compute the average processing time or latency for a packet. On a miss, the data is written from host memory into the TCB requiring another TCB access. Memory access time = (TCB access time + miss rate × host memory access time) Packet latency = instruction execution + memory access + DMA setup Table 5.2 summarizes the average latency per packet and gives an indication of the turnaround time for a single incoming packet. However, a multithreaded design would allow the time spent waiting for host memory to be actually spent on processing a different packet. Consequently the memory access time in the equation is reduced to the time required to access the on-die TCB only. But we now need to account for the thread-switching penalty. Switching penalty = miss rate × number of switches × cycles per switch × cycle time Memory access time = (TCB access time) Packet processing time = instruction execution + memory access + switch penalty + DMA setup In this case, the results are as shown in Table 5.3. Assuming the time for processing transmits is similarly distributed, the TOE multiplexes between receiving and transmitting packets. The bandwidth it can support is inversely proportional to the size of the packets, as shown in Figure 5.10. This analysis shows that the architecture is capable of wire-speed TCP termination at full duplex 10 Gb/s rate for packets larger than 289 bytes. A single threaded design can achieve the same performance for packet sizes
5.2 TA B L E
Instruction execution
62.5 ns
Memory access time
195.5 ns
DMA setup
10 ns
Packet latency
268 ns
Throughput
3.7 M packets/sec
Packet latency for single thread.
5.4
Conclusions
5.3
97
Instruction execution
62.5 ns
Memory access time
38 ns
Switching penalty
5.2 ns
DMA setup
10 ns
Time per packet
115.7 ns
Throughput
8.6 M packets/sec
Throughput computation results for multiple threads.
TA B L E 40
Bandwidth (Gb/s)
35
Multithread Single thread
30 25 20 15 10 5 0 0
5.10
128
256
384 512 640 Packet size (bytes)
768
896
1024
Bandwidth vs. packet size.
FIGURE
larger than 676 bytes, showing greater than two times the difference in performance.
5.4
CONCLUSIONS This chapter has presented an architecture designed to address the most significant issues in achieving full TCP termination for multigigabit Ethernet traffic. Analysis reinforces the philosophy that such an architecture featuring a specialpurpose multithreaded processing engine, optimal hardware assist blocks, and streamlined interfaces, coupled with a simplified programming model, provides the necessary performance and flexibility.
5
98
A High-Speed, Multithreaded TCP Offload Engine for 10 Gb/s Ethernet
ACKNOWLEDGMENTS The authors thank G. Regnier and D. Minturn for useful discussions and help with system interface definitions, and M. Haycock, S. Borkar, J. Rattner, and S. Pawlowski for encouragement and support.
REFERENCES [1]
L. Gwennap, “Count on TCP offload engines,” EE Times, September 17, 2001, www.eetimes.com/semi/c/ip/OEG20010917S0051.
[2]
G. Regnier et al., “ETA: Experience with an Intel Xeon processor as a packet processing engine,” Proceedings 11th Symposium on High Performance Interconnects, 2003, August 2003, pp. 76–82.
[3]
S. McCreary et al., “Trends in wide area IP traffic patterns—A view from Ames Internet Exchange,” ITC Specialist Seminar, Monterey, California, September 14, 2000.
[4]
A. P. Foong et al., “TCP performance re-visited,” ISPASS, March 2003, pp. 70–79.
[5]
K. Kant, “TCP offload performance for front-end servers,” Proceedings of GLOBECOM 2003, December 2003, San Francisco, California, pp. 3242–3247.
[6]
Y. Hoskote et al., “A TCP offload accelerator for 10Gb/s Ethernet in 90-nm CMOS,” IEEE Journal of Solid-State Circuits, 38(11), November 2003, pp. 1866–1875.
[7]
R. Sabhikhi, talk at HPCA9, www.cs.washington.edu/NP2/ravi.s.invited.talk.pdf, February 2003.
[8]
D. Dunning et al., “The Virtual Interface Architecture,” IEEE Micro, 18(2), March-April 1998, pp. 66–76.
[9]
Data sheets at www.micron.com.
6 CHAPTER
A Hardware Platform for Network Intrusion Detection and Prevention Chris Clark,1 Wenke Lee,2 David Schimmel,1 Didier Contis,1 Mohamed Koné,2 Ashley Thomas2 1 School of Electrical and Computer Engineering 2 College of Computing Center for Experimental Research in Computer Systems (CERCS), Georgia Institute of Technology, Atlanta, GA, USA The current generation of network intrusion detection systems (NIDS) have several limitations on their performance and effectiveness. Many of these limits arise from some inherent problems with the traditional placement of the NIDS sensors within the network infrastructure. Sensors are typically positioned at the aggregation points between the internal and external networks and monitor traffic for a large number of internal hosts. However, there may be other external entry points that go unmonitored, such as dial-up and wide-area wireless (cellular) data connections at the end hosts. Also, a sensor at the gateway typically does not monitor traffic between internal hosts, so it cannot detect internal attacks. The performance problems with this type of centralized NIDS placement include limited throughput and poor scalability. Recent studies [1–3] have shown that modern NIDS have difficulty dealing with high-speed network traffic. Others [4, 5] have shown how attackers can use this fact to hide their exploits by overloading an NIDS with extraneous information while executing an attack. Furthermore, centralized NIDS do not scale well as network speed and the number of attacks increases. Since network traffic is increasing faster than computer performance [6] and new attacks appear almost daily, these problems will only get worse with time. Therefore, it is important to explore different architectures for deploying intrusion detection sensors. We suggest that, in order for a network intrusion detection system to accurately detect attacks in a large, high-speed network environment, the bulk of
6
100
A Hardware Platform for Network Intrusion Detection and Prevention
analysis should be performed by distributed and collaborative network node IDS (NNIDS) running at the end hosts. Advantages of this approach over centralized analysis include a large reduction in the quantity of data to be analyzed by each IDS, the ability to analyze end-to-end encrypted traffic, the ability to adapt the analysis based on knowledge of the end system, and the capability to actively control the types and rates of traffic received and sent by a host. An NNIDS is in the unique position to prevent incoming attacks from reaching the host operating system or applications. In addition, an NNIDS can prevent outgoing attacks or quarantine an infected host to keep it from infecting other internal or external hosts. On the other hand, a distributed architecture increases the difficulty of managing the sensors and detecting distributed attacks. However, these issues have been addressed in related contexts [7–13]. Our research aims to develop NNIDS that can keep up (i.e., avoid packet drop) with the traffic rate that an end host can accept. These NNIDS should be able to reliably generate timely and accurate alerts as intrusions occur and have the intrinsic ability to scale as network infrastructure and attack sophistication evolve. Research in algorithms for attack analysis and traffic profiling are important components of this goal. However, our current research focus is on another essential component: design and implementation of a hardware platform that enables high-speed, reliable, and scalable network intrusion detection.
6.1
DESIGN RATIONALES AND PRINCIPLES In this section, we discuss some considerations in the design and implementation of high-speed, reliable, and scalable network intrusion detection systems.
6.1.1
Motivation for Hardware-Based NNIDS In addition to the problems mentioned previously, centralized NIDS have other weaknesses. A common and serious issue is that they typically do not have sufficient knowledge of the network topology and which operating systems are running on the network hosts. As a consequence, the NIDS and a host might interpret the same network traffic differently. This vulnerability allows attackers to evade detection by sending attack traffic to a host that looks harmless from the perspective of the NIDS [4, 5]. In addition, NIDS generally do not have the necessary keys (or enough resources) to examine end-to-end encrypted traffic for every host. This means that data sent over protocols such as SSL or SSH cannot be analyzed by a centralized NIDS, giving attackers another means to evade detection. One remedy to these problems is to use network node IDS (NNIDS) that each monitor the traffic to a single host. An NNIDS can unambiguously analyze
6.1
Design Rationales and Principles
the network data and have access to the key(s) to examine encrypted data. Some NNIDSs have been implemented as kernel- or application-level software. However, the overhead of intrusion detection analysis can severely degrade the performance of other applications running on the host. Furthermore, if an attacker manages to compromise the host, she can also disable the NNIDS so that all of her malicious activities will go undetected. We believe that these shortcomings can be adequately addressed by implementing the NNIDS on the network interface rather than on top of the host operating system. Network processors will be widely available and affordable in the near future and can be integrated into a network interface card (NIC) with a cost similar to other high-end NICs. Having an NNIDS run on a NIC with a network processor has several advantages over a software NNIDS. These include minimal performance impact on the host system and much stronger protection for both the host and the IDS itself. A hardware NNIDS runs independently of the host operating system and can be made “subversion-resistant” so that it continues to function even if the attached host is compromised. An attacker cannot disable the NNIDS even if he penetrates the host because the control flows to the network interface can be very restrictive. These facts make it desirable to install hardware NNIDS in critical systems or even all of the nodes on the network. This deployment scheme can scale to large and complex networks because each NNIDS runs on an affordable NIC and analyzes only the traffic for its attached node. There are other research issues with NNIDS in addition to the placement of the analysis agent. The security policy that dictates network intrusion detection functions must be managed and enforced in a distributed fashion. This problem is similar to managing distributed firewalls [9]. We can learn from the research in distributed firewalls to develop a (perhaps similar) solution to this problem. The NNIDS also need to perform event-sharing and collaborative analysis techniques to detect distributed attacks and share the workload when necessary. This problem is not necessarily unique to NNIDS because an NIDS using loadbalancing techniques needs to deal with the same issue [14, 15]. In other words, we can borrow ideas from other research to address the issues with distributed NNIDS.
6.1.2
Characterization of NIDS Components Before we can design and implement an NIDS on a network processor, we must first analyze the performance characteristics of NIDS analysis. A real-time NIDS monitors network traffic by sniffing (capturing) network packets and analyzing the traffic data according to intrusion-detection rules. Typically, an NIDS runs as application-level software. Network traffic data is captured using an operating
101
102
6
A Hardware Platform for Network Intrusion Detection and Prevention
system utility, stored in OS kernel buffers, and then copied to NIDS application buffers for processing and analysis. We use Snort [16] as an example to describe the main stages of packet processing and analysis in NIDS. In the Snort software, each captured packet goes through the following steps: 1. Packet decoding. Decodes the header information at the different protocol layers and stores the information in data structures. All packets go through this step. 2. Preprocessing. Calls each preprocessor function in order, if applicable. The preprocessors used by default include IP fragment reassembly and TCP stream reassembly. 3. Detection. First, the values in a packet’s header are used to select an appropriate subset of rules for further inspection. This subset consists of all the rules that are applicable to that packet. Second, the selected rules are evaluated sequentially. 4. Decision. When there is a match with one of the detection rules, its corresponding action, alert, or logging function is carried out. An NIDS can be considered a queuing system where the packet buffers are the queues and the NIDS is the service engine. Obviously, if the NIDS processes the packets slower than their arrival, the buffers can be filled up and the newlyarriving packets will be dropped (i.e., not stored). If this occurs, the NIDS may not have sufficient information to accurately analyze the traffic and will fail to detect intrusions. Therefore, it is very important to design and implement NIDS to minimize (or eliminate) dropped packets. In our benchmarking experiments where Snort runs as application-level software, the service time ratios of the above steps are roughly: 3 for decoding, 10 for preprocessing, and 30 for detection. Logging can be very slow because of network or disk I/O. We also observe packet drops when the traffic rate goes above 50 Mbps. In preprocessing, the bulk of compute-time is spent on bookkeeping and thus requires frequent memory accesses. For example, fragments of IP packets and segments of TCP streams need to be stored in data structures and looked up. In detection, the bulk of compute-time is spent on testing the conditions of the detection rules one by one. A typical NIDS can have thousands of detection rules, and each rule can have several conditions that require pattern (or keyword) matching or statistics computation. Another system factor that slows down NIDS is the inefficiency of the network data path. Packet data is captured
6.1
Design Rationales and Principles
103
at the network interface, passed to the kernel via PCI bus, filtered to eliminate unwanted packets, and the remaining packets are stored in kernel buffers.
6.1.3
Hardware Architecture Considerations It is clear from our discussion that there are potential performance gains if the NIDS components are implemented in a network processor where packet processing can take place close to the data source and can be carried out with a pipeline of processing engines. However, there are challenges to realize these performance gains. Intrusion detection is an interesting application from an NP (network processor) hardware architecture perspective because of its substantial resource requirements. Intrusion-detection analysis requires considerably more compute cycles and memory accesses per packet than required by traditional NP applications, such as IP routing and QoS scheduling. The analysis consists of several tasks with varying resource usage patterns; some tasks are compute-bound and some are memory-bound. Furthermore, the amount of work done for each packet is not constant. When designing the NNIDS system architecture, we considered both the requirements of the various analysis tasks as well as the capabilities of each hardware component. Based on these properties and experimental testing, our goal was to determine the most efficient allocation of tasks to hardware resources. Some of these tasks fit well into existing NP architectures and some do not. Figure 6.1 summarizes our criteria for mapping tasks to hardware processing elements. On the IXP, processing requiring relatively few or simple operations to be applied to high-rate data can be implemented on the microengines. We put packet capturing and filtering, decoding, and preprocessing on the microengines. Each of these tasks naturally runs as one or more microengine
Complexity
Data rate
Low
6.1 FIGURE
High
High
Microengines
FPGA
Low
Microengines or StrongARM
StrongARM
Task to hardware allocation.
6
104
A Hardware Platform for Network Intrusion Detection and Prevention
threads. Computations that require complex calculations on lower-rate data are best carried out by the StrongARM processor. We run the IDS decision engine on the StrongARM. Low-complexity tasks operating on low-rate data can be implemented in either the microengines or the StrongARM. There are some IDS tasks that require both complex computation and high throughput. This type of task is not feasible to implement on the network processor. For such cases, our approach is to map the operation onto dynamically reconfigurable hardware, which is able to achieve high performance by optimizing concurrency of the given computation. We use a field-programmable gate array (FPGA) coprocessor to handle this type of task. In our system, the coprocessor handles the keyword pattern-matching functions.
6.2
PROTOTYPE NNIDS ON A NETWORK INTERFACE In this section, we describe a programmable network interface and our implementation of an NNIDS on this platform.
6.2.1
Hardware Platform A block diagram of our hardware platform is shown in Figure 6.2. It uses the Radysis ENP-2505 development board [17] with four 100 Mbps Ethernet ports. The main computational components are an Intel IXP network processor and a Xilinx Virtex FPGA. The FPGA coprocessor board is attached to a PCI mezzanine connector (PMC) and communicates with the NP via an internal 32-bit, 66 MHz PCI bus with a theoretical throughput of 2.1 Gbps. However, the overhead imposed by the PCI interface limits the type of tasks that can be off-loaded to the coprocessor. The long latency of PCI transactions implies that large data transfers are more efficient than small transfers. The FPGA must be able to obtain a large enough compute-time improvement over the NP to justify the cost of moving the computation across the PCI bus. One task that we have successfully off-loaded to the FPGA—packet payload searching—will be discussed in Section 6.2.4. We are also pursuing a more tightly-coupled NP-FPGA interface to improve performance and enable a broader class of tasks to be off-loaded to the coprocessor. This would also allow the system to adapt to changing traffic conditions by dynamically reallocating tasks between the NP and FPGA. An ideal architecture would be to have the coprocessor attached to the NP’s SRAM memory bus and mapped into the NP’s address space as shown in Figure 6.3.
6.2
Prototype NNIDS on a Network Interface
Ethernet
NP StrongARM
105
Internal PCI
PCI bridge
Host PCI
Microengines QDR
S R A M
SRAM
SDRAM
6.2
FPGA
Prototype platform.
FIGURE
Ethernet
NP StrongARM
Internal PCI
Microengines QDR SDRAM
6.3
FPGA
PCI bridge
Host PCI
S R A M
SRAM
Proposed platform.
FIGURE
This makes the cost of accessing the FPGA comparable to the cost of memory reads and writes, enabling very fine-grained partitioning of tasks between the IXP and the FPGA. This is the same type of interface specified by the Network Processor Forum’s Look Aside Interface LA-1.0 [18].
Intel IXP 1200 We use the Intel IXP 1200 network processor [19] in our implementation. It is a system-on-chip containing a StrongARM core and six programmable microengines all running at a clock frequency of 232 MHz. The StrongARM runs an embedded Linux operating system. Each microengine has hardware support for multithreading, and can run a maximum of four threads. The StrongARM and all the microengines share 256 MB of 64-bit SDRAM and 8 MB of 32-bit
6
106
A Hardware Platform for Network Intrusion Detection and Prevention
SRAM in our configuration. The SDRAM has a peak bandwidth of 648 MBps and the SRAM has a peak bandwidth of 334 MBps.
FPGA coprocessor Field-Programmable Gate Arrays (FPGAs) have been used to accelerate many different algorithms, often achieving several orders-of-magnitude better performance than software implementations. This is made possible by their ability to be programmed with circuits customized to the given application and their capacity to perform massively-parallel computations. Our FPGA platform consists of a board containing a Xilinx Virtex-1000 FPGA [20], which is capable of implementing circuits with the equivalent of up to one million logic gates. The FPGA has a PCI interface for I/O as well as its own dedicated high-speed SRAM.
6.2.2
Snort Hardware Implementation We use Snort [16], a popular open-source NIDS software package, as the basis of our prototype NNIDS because it is loosely-coupled and easy to customize. Here, we briefly describe the main components of the Snort software. The packet capturing and filtering module is based on libpcap [21]. The packets are passed to the decoder to process the various packet headers. Each packet then passes through a series of preprocessors, including IP fragment reassembly and TCP stream reassembly. Then the packets are checked by the detection engine. Snort rules are organized to be matched in two phases. The first phase assigns each packet to a group based on the values of some header fields. The set of rules loaded at configuration time determines the number of groups and the header values associated with each group. The second phase performs further analysis that depends on the assigned group, but usually includes a full search of the packet payload for a large number of patterns. Finally, the decision engine uses the results of the detection phase to take appropriate action. Our task was to modify and restructure the sequential Snort software to create a multithreaded, pipelined hardware implementation. To do this, we followed two important design principles. The first principle is to intelligently structure the pipeline so that unwanted (or uninteresting) data can be filtered out as early as possible. In our design, when it is appropriate according to the site-specific configuration policy, the first phase of rule-matching is moved ahead of several preprocessors in order to reduce the amount of subsequent processing for packets that do not trigger a match. The second principle is to split a Snort module if it has multiple processing stages with very different service times. Assigning the stages to different processing engines increases packet-level parallelism in
6.2
Prototype NNIDS on a Network Interface
107
Microengines FPGA Receive/ Filtering
Detection-1
Detection-2
StrongARM Pass Decision engine Drop
6.4
IP Defrag-1
TCP Stream-1
IP Defrag-2
TCP Stream-2
Alert/Log
Analysis pipeline.
FIGURE
the system. In our design, this applies to IP fragment reassembly, TCP stream reassembly, and rule checking. Figure 6.4 shows the analysis pipeline used in our prototype NNIDS. The filtering module performs packet header based filtering. If the packet received is an IP fragment, it is enqueued for fragment reassembly. Otherwise, it is enqueued for phase one of rule checking. IP fragment reassembly is carried out by two subcomponents. Since fragments can arrive out of order, Defrag-1 reorders arriving fragments and inserts them into a linked list. Defrag-2 reassembles the fragments only when the set is complete. It also detects fragmentation anomalies such as overlapping fragments. Similarly, TCP stream reassembly is carried out by two submodules. Stream-1 validates the TCP packet and maintains session state information. Stream-2 reassembles the streams when they are complete or at intermediate points that are appropriate for the underlying application protocol. The detection module is also split into two modules. Detection-1 runs on a microengine and performs the first phase of rule checking. The most significant task in Detection-2, payload pattern-matching, requires too much computation to be run on the microengines or the StrongARM. Therefore, it is completely offloaded to the FPGA. The StrongARM uses DMA transfers to send the packets over the PCI bus to the FPGA. The FPGA compares the packet to all of the stored patterns and generates a list of pattern matches. The decision engine on the StrongARM reads the match results and determines what actions, if any, should be taken.
6.2.3
Network Interface to Host The NNIDS runs on the network interface card so that whenever the host communicates with the outside world, the traffic in both directions is analyzed.
6
108
A Hardware Platform for Network Intrusion Detection and Prevention
Host Host kernel stack Host input FIFO IXP
StrongARM kernel stack Host output FIFO SA output FIFO Detection pipeline
SA input FIFO
Transmit FIFO Receive firmware
Split traffic
Transmit firmware
Network
6.5
Network Interface with NNIDS.
FIGURE
We have implemented a bidirectional path between the network and the host that is based on Ref. [22]. Figure 6.5 shows the data flow for incoming and outgoing traffic. A host device driver makes our platform function as a conventional Ethernet interface in Linux. Since the network interface is performing some TCP/IP functions that would normally be done by the host anyway, it would be possible to offload these tasks from the host by developing an interface to a higher layer on the OS network stack. A region of the IXP SDRAM is mapped to the host address space and used as a packet FIFO by the device driver to transmit outbound traffic to the IXP. Similarly, a region of host RAM is mapped to the IXP address space and used as a FIFO for inbound traffic to the host. When active response is the local policy, firmware running in the IXP will determine whether to pass or drop each packet based on the analysis output. A second network device driver is implemented to allow the StrongARM to communicate with the outside world through the network. This enables remote administrators to send control and configuration messages to the StrongARM and receive status or alert information. In our design, all connections to the
6.2
Prototype NNIDS on a Network Interface
109
StrongARM are through this driver and treated the same. This means that a connection from the host to the StrongARM is treated the same as connections from an outside workstation, and is subject to intrusion-detection processing. Thus, even when the host is compromised, the NNIDS will continue to function because attempts to compromise the system from the host can be detected and blocked by the detection engine.
6.2.4
Pattern Matching on the FPGA Coprocessor One of the most computationally-intensive tasks performed by Snort is patternmatching on packet content [23]. Despite improved software pattern-matching algorithms [23, 24], pattern-matching is still the limiting factor in the analysis of high-speed traffic. Furthermore, the NP does not have the processing resources to handle this task. We eliminate this bottleneck by off-loading all the patternmatching tasks to a Field-Programmable Gate Array (FPGA) coprocessor. The task of pattern-matching in NIDS consists of comparing a large number of known patterns against a stream of packets. An FPGA is well-suited for this task because it can implement thousands of pattern comparators operating in parallel. We have developed an FPGA design that compares a packet’s content against every pattern in the Snort ruleset (over 1500 patterns) simultaneously [25]. This design provides high character density and high throughput, enabling the entire ruleset to fit into a low-end FPGA device while handling up to 1 Gbps of data. A block diagram of the FPGA pattern-matching coprocessor is shown in Figure 6.6. The design is pipelined to process one character of packet data per clock cycle. An input buffer stores incoming 32-bit data words and serializes the bytes to output 8-bit characters. Next, the current character is decoded
32
6.6 FIGURE
Input buffer
Pattern matchers Rule match vector Rule 0, Pattern 0 Character m0-0 a b c R0 decoder R1 a b N 8 c . Rule 0, Pattern 1 d m0-1 . c d e . . . Rule 1, Pattern 0 m 1-0 RN-1 . e
FPGA pattern-matching coprocessor block diagram.
Output encoder
32
6
110
A Hardware Platform for Network Intrusion Detection and Prevention
and character-match signals are distributed to the pattern-matching units. A pattern-matching unit is instantiated for each pattern in the ruleset. The pattern matchers use a nondeterministic finite automata (NFA) technique to track matches between the input data and the stored patterns. Each pattern-matching unit has an output indicating that a complete pattern match has occurred. For rules with multiple patterns, all of the corresponding pattern-match outputs are passed through an AND gate to generate a rule match output. The rule match signals for all N rules are stored in a match vector. After the last character of a packet is processed, the output encoder packs the match results into 32-bit words and sends them to the IDS decision engine. We have developed a software tool that translates a Snort rule file into an FPGA circuit description for matching pattern strings. The circuit description is then sent over the network to the NNIDS where it is used to reconfigure the FPGA pattern-matcher. The circuit generator software supports all the standard Snort rule options for pattern-matching. An additional feature not available in Snort is approximate pattern-matching [26]. Each pattern in a Snort rule can be specified to allow a certain number of character mismatches (substitutions, insertions, or deletions) between the pattern and a packet’s content. This is useful for detecting an attack pattern that is expected to contain some variable content, but the exact variations are unknown or too numerous to list as separate patterns. It can also help detect new exploits that are similar to known exploits.
6.2.5
Reusable IXP Libraries Programming the microengines is difficult because there is no operating system or support library. In the course of this project, we have developed a set of libraries and development tools that are essential for building NIDS on the IXP. These include a memory management library, a queue management library, a multithreaded packet capturing and filtering library, an IP fragment reassembly library, and a tool that converts standard tcpdump captures to the format used by the IXP simulator.
6.3
EVALUATION AND RESULTS We evaluated the prototype system by performing functional verification, micro-benchmarks, and system-level benchmarks. The results are presented and analyzed in this section.
6.3
Evaluation and Results
6.3.1
111
Functional Verification In order to verify that our system produces correct results, we compared it with the standard software distribution of Snort. We attached a computer with our NNIDS and a computer running standard Snort to a network hub. We also attached another computer with traffic-generation software to the same hub. The traffic generator was used to send traffic containing a mixture of attack and nonattack traffic to the hub, allowing the traffic to be received simultaneously by both IDS computers. The output logs of each IDS sensor were compared, and we found that our system generated the same set of alerts as the standard Snort software.
6.3.2
Micro-Benchmarks For each of the NP components, we used the cycle-accurate IXP Developer’s Workbench Simulator to thoroughly test the component and measure its performance. For the FPGA pattern-matching component, we ran the test in hardware and used timers in the StrongARM to measure performance.
Receive Since there is a large overhead for processing each packet’s header, the biggest influence on receive performance is the packet size, which determines the number of packet arrivals per second. We tested this module with a range of packet sizes and determined its achievable throughput based on the number of clock cycles required for each packet. The results are presented in Table 6.1.
Packet size (bytes)
6.1 TA B L E
Cycles/packet
Throughput (Mbps)
64
1863
64
512
3906
243
1024
6642
286
Receive performance.
6
112
A Hardware Platform for Network Intrusion Detection and Prevention
IP defragmentation The critical factor in IP defragmentation processing is the number of fragments per packet. We find that the performance decreases as the number of fragments increases. The first phase of processing (Defrag-1) is a memory-bound process because the number of memory accesses required to insert fragments into the storage data structure is a function of the number of fragments, but the calculations performed on each accessed memory value are minimal. On the other hand, the second phase (Defrag-2) is a compute-bound process with execution time as a function of the number of packets because it must perform several consistency checks on each fragment before building the defragmented packet. Tables 6.2 and 6.3 show the throughput of each phase for a 512-byte packet with varying numbers of fragments.
Rule-checking phase one Detection-1 searches through a list of header values to determine if a given packet matches any of the rule-header values. The list is structured so that there can be at most one match. Therefore, the worst case is when no match is found because the whole list must be traversed. This is a memory-bound process because only Number of fragments 4
6.2 TA B L E
6.3 TA B L E
Cycles/frag
Throughput (Mbps)
842
282
8
931
128
16
1215
49
32
1381
22
Defrag-1 performance.
Number of fragments
Cycles/packet
Throughput (Mbps)
4
3203
297
8
4279
222
16
14519
65
32
25512
37
Defrag-2 performance.
6.3
Evaluation and Results
113
simple comparison tests are performed on each accessed memory value. With a single thread running this process, we find that the throughput is low in the worst case (34 Mbps) since the microengine is idle most of the time waiting for SRAM memory operations to complete. Performance could be improved by using multiple threads with each processing a different packet. Another way to help performance here would be to store the list of values in faster memory. Since the list is relatively small and changed infrequently, an ideal location would be in microengine local memory. The IXP 1200 microengines do not have local memory, but the IXP 2x00 microengines do.
Rule-checking phase two The throughput of Detection-2 depends heavily on the time required to transfer a packet from the IXP to the FPGA over the PCI bus. As expected, the performance is better for large packets than for small packets. Once the data reaches the FPGA, the processing is completed very quickly. However, the PCI interface limits the overall performance of this module. As mentioned earlier, we hope to reduce this limitation by developing a higher-performance interface between the IXP and the FPGA. It is important to remember that our pipelined system is designed to filter uninteresting packets as soon as possible. Thus, for normal traffic, the rate of data reaching this final stage will be significantly less than the rate at the initial receive stage. Table 6.4 shows the worst-case performance, which is when all incoming packets reach the Detection-2 phase. The important metrics for the FPGA pattern-matcher are the number of pattern characters it can store and its throughput. We ran tests with different size rule sets loaded, including the full set of default rules in the Snort software package that contains 17,537 characters. Generally with FPGAs, an increase in logic resource usage causes increased interconnect delay and reduced maximum operating frequency. Table 6.5 shows the throughput supported by the FPGA circuit
Packet size (bytes)
6.4 TA B L E
Throughput (Mbps)
64
16
512
34
1024
51
Detection-2 worst-case performance.
6
114
A Hardware Platform for Network Intrusion Detection and Prevention
Number of characters
6.5
Resource usage
Freq (MHz)
Throughput (Mbps)
2001
17%
119
951
4012
25%
115
916
7996
42%
101
809
17537
80%
100
801
FPGA pattern-matching performance.
TA B L E
for each rule set, but again, the actual throughput is limited by the PCI I/O connection.
6.3.3
System Benchmarks We ran some system-level benchmarks to determine how the components of the detection pipeline perform together. The testing environment was the same as that described in Section 6.3.1. We modeled our experiments after tests described in a report issued by the NSS Group [27], a testing lab for commercial IDS products. These tests are designed to measure the performance of the system under varying levels of load. We used a traffic generator to send different rates of fixed-size UDP packets to the NNIDS sensor. Because of limitations of the software and hardware in our packet-generating computer, we were not able to run tests at maximum rate with minimum-sized packets. Because there is a fixed processing overhead for each packet, tests using small packets generally yield lower performance since there are more packets being sent per second. Due to our design goal of ending the analysis of a packet as early as possible in the pipeline, the content of the packets has an effect on the performance. The most significant factor is the outcome of the Detection-1 stage. If a packet’s header matches the values of certain fields in one of the Snort rules, it must be further checked by the Detection-2 phase. Otherwise, no further processing is necessary. Due to the communication bottleneck in Detection-2, it can become the limiting component under high utilization. To determine the effects of packet size and Detection-1 matches, we ran two sets of tests: one with zero Detection-1 matches and one with 100 percent Detection-1 matches. The results of these tests are presented in Tables 6.6 and 6.7, respectively. For each rate and packet size, we measured the percentage of packets that the sensor was able to process and determined the maximum rate at which the sensor could operate without dropping any packets.
6.4
Conclusions
115 Packet size
25 Mbps
50 Mbps
75 Mbps
100 Mbps
Max (Mbps)
64
100%
100%
100%
∗
512
100%
100%
100%
100%
100
1024
100%
100%
100%
100%
100
75
∗ Our traffic generator could not send traffic at this rate for this size.
6.6
Best case (0% Detection-1 matches).
TA B L E
Packet size
25 Mbps
50 Mbps
75 Mbps
100 Mbps
Max (Mbps)
64
69%
40%
25%
∗
512
100%
100%
100%
100%
100
15
1024
100%
100%
100%
100%
100
∗ Our traffic generator could not send traffic at this rate for this size.
6.7
Worst case (100% Detection-1 matches).
TA B L E
These tests show that our NNIDS network interface card, running on a 232 MHz IXP 1200 and a 100 MHz Xilinx Virtex-1000 FPGA, was able to achieve performance approximately equal to that reported by the NSS Group in their test of the Snort 2.0.2 software running on a high-end server with dual 1.8 GHz Pentium 4 processors and 2 GB RAM [27].
6.4
CONCLUSIONS We have discussed the need for building high-speed NIDS that can reliably generate alerts as intrusions occur and have the intrinsic ability to scale as network infrastructure and attack sophistication evolves. We have analyzed the key design principles and have argued that network intrusion-detection functions should be carried out by distributed and collaborative NNIDS at the end hosts. We have shown that an NNIDS running on the network interface instead of the host operating system can provide increased protection, reduced vulnerability to circumvention, and much lower overhead.
6
116
A Hardware Platform for Network Intrusion Detection and Prevention
We have also described our experience in implementing a prototype NNIDS, based on Snort, an Intel IXP 1200, and a Xilinx Virtex-1000 FPGA. We also developed, and will make available, several libraries that are essential for building IDS on the IXP. We have conducted benchmarking experiments to study the performance characteristics of the NNIDS components. These experiments help us identify the performance bottlenecks and give insights on how to improve our design. System stress tests showed that our embedded NNIDS can handle high-speed traffic without packet drops and achieve the same performance as the Snort software running on a dedicated high-end computer system. Our ongoing work includes optimizing the performance of our NNIDS, developing strategies for sustainable operation of the NNIDS under attacks through adaptation and active countermeasures, studying algorithms for distributed and collaborative intrusion detection, and further developing the analytical models for buffer and processor allocation. We are in the process of porting our design to the next generation of IXP processors and plan to utilize higher-performance and more tightly-integrated FPGA resources. We expect our system to reach multigigabit performance on the IXP 2400 and IXP 2800. We have tested FPGA pattern-matching designs that approach 10 Gbps throughput with the entire Snort ruleset using a Xilinx Virtex2 device. At rates beyond 10 Gbps, even with top-of-the-line FPGAs, it is not possible to fit all the Snort patterns into a single chip. However, we have developed designs capable of pattern-matching at up to 100 Gbps with a smaller ruleset, and multiple FPGAs can be used in parallel to increase pattern capacity [28]. In summary, we have provided a better understanding of the design principles and implementation techniques for building high-speed, reliable, and scalable network intrusion detection systems.
REFERENCES [1]
J. Allen, A. Christie, W. Fithen, J. McHugh, J. Pickel, and E. Stoner, “State of the Practice of Intrusion Detection Technologies,” CMU/SEI, Technical Report 99-TR-028, 2000.
[2]
R. Lippmann, D. Fried, I. Graf, J. Haines, K. Kendall, D. McClung, D. Weber, S. Webster, D. Wyschogrod, R. Cunninghan, and M. Zissman, “Evaluating intrusion detection systems: The 1998 DARPA off-line intrusion detection evaluation,” Proceedings of DARPA Information Survivability Conference and Exposition, vol.2, pp. 12–26, 2000.
[3]
R. Lippmann, J. Haines, D. Fried, J. Korba, and K. Das, “Analysis and results of the 1999 DARPA off-line intrusion detection evaluation,” Proceedings of Recent Advances in Intrusion Detection (RAID), pp. 162–182, 2000.
References
117
[4]
V. Paxson, “Bro: A system for detecting network intruders in real-time,” Computer Networks, 31(23–24), pp. 2435–2463, 1999.
[5]
T. H. Ptacek and T. N. Newsham, “Insertion, Evasion, and Denial of Service: Eluding Network Intrusion Detection,” Secure Networks Inc., Technical Report, 1998.
[6]
L. G. Roberts, “Beyond Moore’s law: Internet growth trends,” IEEE Computer, pp. 117–119, January 2000.
[7]
“3Com Embedded Firewall Architecture for e-business,” 3Com Corporation, Technical Brief, 2002.
[8]
J. Balasubramaniyan, J. Garcia-Fernandez, D. Isacoff, E. Spafford, and D. Zamboni, “An architecture for intrusion detection using autonomous agents,” Proceedings of Computer Security Applications Conference, pp. 13–24, 1998.
[9]
S. M. Bellovin, “Distributed Firewalls,” login:, November 1999.
[10]
R. Gopalakrishna and E. H. Spafford, “A framework for distributed intrusion detection using interest-driven cooperating agents,” Proceedings of Recent Advances in Intrusion Detection (RAID), 2001, www.raid-symposium.org/Raid2001/papers.
[11]
C. Payne and T. Markham, “Architecture and applications for a distributed embedded firewall,” Proceedings of Computer Security Applications Conference, 2001. www.acsac.org/2001/papers/73.pdf.
[12]
P. A. Porras and P. G. Neumann, “EMERALD: Event monitoring enabling responses to anomalous live disturbances,” Proceedings of National Information Systems Security Conference, pp. 353–365, 1997.
[13]
G. Vigna, R. A. Kemmerer, and P. Blix, “Designing a web of highly-configurable intrusion detection sensors,” Proceedings of Recent Advances in Intrusion Detection (RAID), pp. 69–84, 2001.
[14]
“Gigabit Ethernet Intrusion Detection Solutions: Internet Security Systems RealSecure Network Sensors and Top Layer Networks AS3502 Gigabit AppSwitch Performance Test Results and Configuration Notes,” White Paper, 2000.
[15]
C. Kruegel, F. Valeur, G. Vigna, and R. A. Kemmerer, “Stateful intrusion detection for high speed networks,” Proceedings of IEEE Symposium on Security and Privacy, pp. 285–293, 2002.
[16]
M. Roesch, “Snort—Lightweight intrusion detection for networks,” Proceedings of USENIX LISA Conference, 1999.
[17]
“ENP-2505/2506 Data Sheet,” RadiSys Corporation, www.radisys.com/oem_products/ds-page.cfm?productdatasheetsid=1055.
[18]
“Look Aside Interface LA-1.0,” Network Processor Forum, www.npforum.org/techinfo/approved.shtml.
[19]
“Intel Network Processors,” Intel Corporation, www.intel.com/design/network/products/npfamily/.
[20]
“Virtex and Virtex-E Overview,” Xilinx, Inc, www.xilinx.com/xlnx/xil_prodcat_product.jsp?title=ss_vir.
6
118
A Hardware Platform for Network Intrusion Detection and Prevention
[21]
S. McCanne, C. Leres, and V. Jacobson, “libpcap,” 1994. ftp.ee.lbl.gov.
[22]
K. Mackenzie, W. Shi, A. McDonald, and I. Ganev, “An Intel IXP1200-based network interface,” Proceedings of Workshop on Novel Uses of System Area Networks at HPCA (SAN-2), 2003. www.cs.arizona.edu/hpca9.
[23]
M. Fisk and G. Varghese, “Fast Content-based Packet Handling for Intrusion Detection,” UCSD, Technical Report CS2001-0670, 2001.
[24]
S. Staniford, C. J. Coit, and J. McAlerney, “Towards faster string matching for intrusion detection,” Proceedings of DARPA Information Survivability Conference, vol.1, pp. 367–373, 2001.
[25]
C. R. Clark and D. E. Schimmel, “Efficient reconfigurable logic circuits for matching complex network intrusion detection patterns,” Proceedings of International Conference on Field Programmable Logic and Applications (FPL), pp. 956–959, 2003.
[26]
C. R. Clark and D. E. Schimmel, “A pattern-matching co-processor for network intrusion detection systems,” Proceedings of International Conference on Field-Programmable Technology (FPT), pp. 68–74, 2003.
[27]
“100 Mbps IDS Group Test, Edition 4,” The NSS Group, 2003, www.nss.co.uk.
[28]
C. R. Clark and D. E. Schimmel, “Scalable pattern matching on high-speed networks,” Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 249–257, 2004.
7
Packet Processing on a SIMD Stream Processor
CHAPTER
Jathin S. Rai, Yu-Kuen Lai, Gregory T. Byrd Department of Electrical and Computer Engineering, North Carolina State University The current generation of commercial network processors (NPs) [1] use chip multiprocessing to take advantage of packet-level parallelism. Memory latency is tolerated with the help of hardware multithreading. To improve performance, many NPs make use of dedicated hardware units called coprocessors, which implement important tasks in hardware. The noncoprocessor approach is a more flexible solution—all the packet processing tasks are performed in software, with the help of specialized instruction sets to satisfy the performance requirement. Both of these architectures are extremely successful, but they do present some inefficiencies, previously discussed by Seshadri and Lipasti [2]: ✦
The coprocessor approach lacks flexibility and does not scale well to different applications and protocols, because certain functions are fixed in hardware.
✦
In both approaches, unsynchronized and arbitrary memory references by different processing elements fail to extract the maximum memory bandwidth.
✦
The processing elements normally access shared data structures, which requires extra synchronization logic to provide for mutual exclusion. This complicates the programming model and can result in a performance bottleneck.
To address these issues, this chapter explores the use of a SIMD stream processing architecture for packet processing. Stream processors are designed to support the stream programming model [3]. A stream program applies a consistent set of operations (compound operations) to each of a sequence (stream) of elements. For packet processing, each stream element could be a packet, or the contents of a packet could be viewed as a stream of data elements. A set of complex operations
7
120
Packet Processing on a SIMD Stream Processor
is performed on the stream elements, with all temporary data being generated and consumed locally, thereby reducing the number of trips to the memory. Stream programs exploit the SIMD (Single Instruction stream, Multiple Data stream) mode of parallelism, since the same complex operation is performed on every stream element. SIMD execution mode simplifies access to shared data, because explicit synchronization is not required. Memory operations can be coordinated and optimized for high bandwidth, using vector primitives like indexed memory and scatter-gather. Unlike traditional SIMD applications, however, network applications exhibit load imbalance and control flow variations, due to variable packet sizes and nonuniform processing requirements. Conditional streams [4] provide one approach for coping with these issues. After a brief introduction to the stream programming model and the Imagine architecture (Section 7.1), we present the implementation and performance of two applications: AES packet encryption (Section 7.2) and IPv4 packet forwarding (Section 7.3). These applications represent the two extremes of computation-intensive (encryption) and memory-intensive (forwarding) applications. With some minor architectural changes, we find that a 500-MHz Imagine processor can deliver 1.6 Gb/s throughput for encryption and 2.9 Gb/s for forwarding on real packet traces.
7.1
BACKGROUND: STREAM PROGRAMS AND ARCHITECTURES For this performance and feasibility study, we have chosen the Imagine stream processor [5], designed by Kapasi et al. for media processing applications. Imagine provides eight multi-ALU compute clusters, operating in SIMD mode, and a stream-oriented memory hierarchy.
7.1.1
Stream Programming Model The stream programming model [3] expresses data as streams, which are sent to the kernels responsible for the processing of this data. The kernels consume a set of streams from the stream program, perform compound operations on the stream elements, and produce a set of output streams. The stream program is responsible for sending the input streams to the kernels, invoking the kernels, and storing the output streams generated by the kernels. Figure 7.1 illustrates the data flow for a sample stream application.
7.1
Background: Stream Programs and Architectures
Input stream 1
KERNEL A KERNEL B
Input stream 2
7.1
121
Output stream
KERNEL A
Data flow of a sample stream application.
FIGURE
The stream programming model exposes the inherent parallelism in applications like packet processing so that it can be exploited by the stream architecture. This helps in maintaining high performance densities, while also providing programmability. This study relies on a set of stream programming tools developed at Stanford University for the Imagine processor [5]. These tools include the StreamC and KernelC programming languages and a stream scheduler for Imagine.
7.1.2
Imagine Stream Architecture Imagine [5] is a processor designed to support the stream programming model for media processing applications. The block diagram of the Imagine stream processor is shown in Figure 7.2. It consists of eight VLIW clusters operating in a SIMD fashion, controlled by a microcontroller. Each cluster contains three adders, two multipliers, one divider, a 256-word scratchpad for temporary storage, and an intercluster communication unit. An attached host processor executes the high-level stream program, which deals with the creation and coordination of streams and kernels. Kernels are executed on the arithmetic clusters. The arithmetic clusters are supported by a three-level memory hierarchy. The highest level in the hierarchy is the local memory provided for each cluster, known as the Local Register File (LRF). There is a centralized register file known as the Stream Register File (SRF), which is used to stage and store data (streams) to/from the clusters and the memory. The lowest level consists of a streaming memory system that controls four DRAM chips. The stream programming model encourages the use of the LRF and SRF. A kernel produces and consumes local temporary values in the LRF; these values are never stored in main memory. Longer-lived values, passed between kernels, are expressed as streams, which are stored in the SRF. The main memory is needed only for the initial and final stream values, or for streams that are too large to fit in the SRF.
7
122
Packet Processing on a SIMD Stream Processor
Host processor
Host interface
Other Imagine nodes, I/O
Stream controller
Network Interface Microcontroller ALU cluster 7 ALU cluster 6
S D R A M
ALU cluster 5 Streaming memory system
Stream register file
ALU cluster 4 ALU cluster 3 ALU cluster 2 ALU cluster 1 ALU cluster 0
Imagine stream processor
7.2
Architecture of the Imagine stream processor [5].
FIGURE
For the purposes of this study, we assume that packets are placed in main memory before being processed by the stream processor. The IPv4 forwarding application processes a stream of packet headers. For the AES encryption application, each packet is viewed as a stream of data blocks. The Imagine processor is designed to operate at a clock frequency of 500 MHz. In the sections below, we describe some architectural modifications to Imagine to better support our packet processing applications. Even with the modifications, we assume that the processor continues to operate at 500 MHz. The performance studies for this chapter are based on a cycle-accurate simulator provided by Stanford.
7.2
AES ENCRYPTION Networking applications tend to be either memory-intensive or computationintensive. We chose a representative of each class to investigate the usefulness
Y FL
7.2
AES Encryption
M A E T
of the stream architecture for general packet processing. The computationintensive application is packet encryption, using the AES (Advanced Encryption Standard) [6] symmetric-key cipher. AES has been implemented on all kinds of platforms, and it has proved to be both versatile and fast. The Wireless LAN 802.11i working group is adopting AES to replace the existing vulnerable WEP algorithm. Moreover, the iSCSI protocols for Storage Area Network (SAN) are relying on IPSec where AES is the key for its secure data protection. For this benchmark, a sequence of packets is provided, along with a sequence of keys, and the payload of each packet is encrypted with the corresponding key. AES operates on 128-bit blocks of data, while the key can be any one of three sizes: 128, 192, and 256 bits. For this work, we choose a 128-bit key. For this combination of block size and key size, the cipher algorithm performs ten rounds of cryptographic operations in the main loop. There are four major functions within the round loop: SubBytes(), ShiftRows(), MixColumns(), and XorRoundKey(). Following the main loop is the final round, where only SubBytes(), ShiftRows(), and XorRoundKey() are applied. There is a very efficient way of implementing the cipher by using a lookup table, known as the T-table [7], on a 32-bit processor. The T-table is the result of one complex transformation on SubBytes(), ShiftRows(), MixColumns(), and XorRoundKey(). Hence, the main loop (without the final round) of encryption can be done in a table-lookup fashion. Each T-table (Ti ) is a rotated version of the previous T-table (Ti−1 ). Therefore, with the expense of an extra rotation operation, storing only one T-table is enough. For the final round, the S-box [substitution table used in the SubBytes() function] has to be used instead of the T-table, due to the absence of the MixColumns() operation [6]. The S-box is not explicitly used in our work in order to save space. Instead, the S-box is derived by an extra mask operation on the T-table.
7.2.1
Design Methodology and Implementation Details The stream-level flow diagram is shown in Figure 7.3. Both the input key and the data stream consist of a collection of records. Each record serves as the building block of the stream and is defined as a data type consisting of four 32-bit words (a total of 128 bits). The input key stream and data stream have to contain a number of records that is a multiple of the number of the clusters. In other words, the minimum number of records in the input key stream is eight for a system with eight clusters. Given a key stream with eight records, the subkey stream will contain 88 records in an interleaved form after the key expansion process.
123
7
124
Input Key stream
Packet Processing on a SIMD Stream Processor
KEY_EXPANSION KERNEL Subkey stream
Data stream
7.3
CORE KERNEL
FINAL_ROUND KERNEL
Output stream
Stream-level diagram of the encryption application.
FIGURE
The core and final_round kernels The AES encryption operation contains two major kernels. The core kernel consists of the intercluster communication for subkeys, T-table lookup, and the arithmetic operations to encrypt a block. The core kernel takes the subkey stream and stores eight sets of the subkeys in the scratchpad of each cluster. Extra intercluster communications are needed to transfer the subkeys if each cluster is encrypting the data block with the same set of subkeys. The final_round kernel is implemented such that an extra rotate and mask instruction is applied to the T-table to derive the S-box value for the byte substitution transformation. Following the ShiftRows() and XorRoundKey() operations, the encrypted data will be sent out as a data stream. Originally, on the Imagine processor, each cluster contains a single 256-word scratchpad register file, so that each cluster has the capability of supporting coefficient storage, short arrays, small lookup tables, and some local register spilling [8]. For our simulations, the size of the scratchpad is changed to 512 words, in order to accommodate the T-table and the other array variables used in the kernels. The core kernel consumes 72.5 percent of the whole encryption cycle. In the core kernel, 216 read operations are found out of 252 scratchpad accesses. The scratchpad has one output and three input units, which allows simultaneous read and write access [9]. However, the ratio for read and write accesses to the scratchpad in the core kernel is 6 to 1, since only read access is needed in the main round operation. Among the 216 scratchpad read operations, 180 are located in Basic Block 4 of the core kernel, where the T-table lookup is performed. The critical path can be reduced up to 15 percent by adding an additional scratchpad to allow concurrent reads of the T-table. Therefore, a second scratchpad is implemented and added into the machine description file to hold the second T-table, such that two simultaneous read accesses can be provided. All of the
7.2
AES Encryption
performance results presented in this section are based on the configuration with two 512-word scratchpads.
The key_expansion kernel The key_expansion kernel is based on the AES Key Schedule algorithm [7]. Due to the sequential nature of the algorithm, the key_expansion kernel is implemented such that each cluster can take one key for processing. Therefore, with eight clusters, the processor can generate up to eight different sets of subkeys at the same time. The kernel consists of two basic blocks. The first basic block saves the incoming key stream into the scratchpad, and the main key expansion loop is in the second block. The effective parallelism achieved in Basic Block 1 is only 2.79, with total runtime of 379 cycles. The effective parallelism is defined as the “ratio of the total number of instructions per block to the number of cycles in the critical path” [10]. As indicated by the effective parallelism, the kernel does not fully utilize the ALU resources provided. Given the same hardware configuration with eight clusters in the Imagine, the ILP can be increased by simply processing two different keys at the same time in a cluster. A dual version of the key_expansion kernel is implemented, in which there are up to 16 different subkeys calculated at the same time. The scheduling result shows that the processing capability is doubled with a 24.2 percent increase in kernel run time (four-adder configuration) while achieving the effective parallelism of 4.5.
7.2.2
Experiments The cycle counts for encryption are measured by subtracting the time for loading the microcode and key expansion from the total cycles. Three different machine configurations are applied during the simulation. Those are denoted as add3, add4, and add6. Add3 is the original Imagine machine description file, which has three adders in each cluster. The add4 and add6 configurations increase the number of adders to four and six, respectively. For all three configurations, there are two 512-word scratchpads.
Varying the stream size For this application, the sequence of data blocks in the packet payload is treated as a stream. For the first set of experiments, multiple numbers of equal-sized packets, ranging from 8 to 96 blocks each, are sent into the kernel. The total amount of data is 61,440 blocks (960 K bytes), which is 7.5 times larger than the stream register file, meaning that packet data must be transferred from DRAM
125
7
126
Packet Processing on a SIMD Stream Processor
80 Add3
Add4
Add6
70
Cycles per block
60 50 40 30 20 10 0 8
7.4
16 32 64 96 Var* Size of the packet stream (16-byte block)
AES performance with varying stream sizes.
FIGURE
to the SRF during the encryption. Therefore, if 16 blocks is picked as the size of a packet, then the total number of packets being processed will be 3840. The simulation results are shown in Figure 7.4, where the size of 96 blocks has the best performance. The throughput is 2.02 Gb/s with a system clock of 500 MHz. The purpose of this setup is to have a full duplex stream flow between the SRF and the main memory. Therefore, the effectiveness of overlapping memory latency with the kernel computation can be observed. Figure 7.5 demonstrates the ratio of the kernel runtime to total runtime. The total runtime consists of the stream operations, stalls, and kernel runtime. For the packet size of eight blocks, the kernel takes only 60 percent of the total runtime. However, as the packet size increases, the kernel runtime can take up to 98 percent of the total run time. The performance with a small data stream suffers from the short stream effect [11], as seen by the eight-block performance in Figure 7.4. This is due to a fixed amount of cost that must be paid (variable initialization, constant setup, etc.) before and after the main loop inside a kernel; if the size of the stream is short, the fixed cost cannot be amortized across the runtime. Another fixed cost for intercluster communication is imposed on the core kernel where codes are modified to be able to transfer the subkeys within clusters for key agility. The variable-sized packets are also simulated and shown in Figures 7.4 and 7.5 denoted as Var. The trace (AIX-1054837521-1) [12] used for this
7.2
AES Encryption
127
% 100 90 80 70 60 50 40 30 20 10 0 8
16 32 64 96 Var* Size of the packet stream (16-byte block) Stream
7.5
Stalls
Kernel
AES kernel execution as a percentage of runtime.
FIGURE
simulation was collected from the NASA Ames Internet exchange (AIX) [13]. It is collected from one of the OC-3 ATM links that interconnect AIX and MAEWest in San Jose. Almost 50 percent of the packets are under the size of 128 bytes, accounting for less than 6 percent of the total bandwidth. On the other hand, almost 12 percent of the packets are 1500 bytes, consuming more than 75 percent of the total bandwidth. The average behavior is therefore close to the best-case, large-stream performance.
Key agility For a security gateway router, where the encryption service is provided for multiple sessions of users, there exists a worst-case scenario that every incoming packet has to be encrypted by a new key. Therefore, the ability for a system to efficiently handle the key changes without degrading performance is a critical performance factor. One of the commonly used schemes [14] is to compute the round key expansion on the fly in pipelined fashion. Another scheme is to precompute the round keys in advance, as soon as the security parameters for a
7
128
Packet Processing on a SIMD Stream Processor
flow are established, before the actual messages arrive. However, the drawback is that the memory storing these expanded round keys has to be increased in proportion to the ratio of the expansions. The other way is to expand only the sets of subkeys that are going to be used soon. Based on the assumption of a store-and-forward architecture, in which the incoming packet will be stored in the data memory, it is possible for a host processor to look ahead into the control memory to identify the next eight packets that are going to be processed. Similar to the previous discussion, the host processor can initialize the key stream, which contains eight different keys, to the key_ expansion kernel before the packet encryption begins. After the keys are expanded, all the subkeys are stored inside the scratchpad of cluster 0 to 7, where cluster 0 has the first set of subkeys, cluster 1 has the second set of subkeys, and so forth. Using the intercluster communication network, each set of subkeys can be broadcast to all the clusters; therefore, all the clusters can process the blocks of the same packet with the same subkeys. After the end of processing the eighth packet, the key_expansion kernel will be executed again to calculate the next eight subkeys for the packets to be processed. Similar to the experiment setup in previous section, the key_expansion kernel is executed once every eight packets since eight different keys can be expanded at the same time. Therefore, for a packet size of eight blocks, there are 7680 128-byte packet streams being sent into the clusters, and 960 key streams (each containing eight 128-bit keys) are consumed by the key_expansion kernel. Performance with key agility is shown in Table 7.1. The worst-case scenario is for the packet size of eight blocks, since the key_expansion kernel has to be executed more frequently. The runtime for the key_expansion kernel is 381 cycles; therefore, on average, an extra six cycles per block will be the overhead over encrypting with a single key. The core kernel consumes a fixed amount of time to transfer a set of subkeys (44 words) from the scratchpad in the cluster. As the
Size of packet stream (16-byte blocks)
7.1 TA B L E
Cycles per block
8
85.98
16
44.97
32
37.57
64
34.45
96
33.41
Key-agility performance (four adders, code optimized).
7.2
AES Encryption
129
packet size gets smaller, the overhead is obvious. This overhead is in addition to the short stream effect, discussed earlier. The best efficiency can be achieved only in the case where all eight clusters are processing 8 or 16 different keys. On the other hand, if only one new key is needed while the other 7 or 15 keys remain the same, the efficiency is the worst, because the same calculation is repeated again. Another way to improve performance and efficiency is to add an extra layer of memory between the SRF and the Clusters to serve as a subkey cache. However, this might need a large cache size to achieve a satisfactory hit rate. This will be explored in future work.
Varying the number of clusters In this study, the data blocks within a packet are distributed and processed among the clusters. This is a simple way to preserve the arriving packet sequence without having an extra reordering mechanism. However, based on the packet length distribution from a real Internet trace, doubling the number of clusters from 8 to 16 results in only a limited additional speedup, and the efficiency is below 80 percent [15]. One way to improve both performance and efficiency is to concatenate more packets from the same flow together to form a larger data stream for encryption. We leave this as an area for future work.
Cluster statistics for AES encryption Table 7.2 shows the occupancy of the functional units while encrypting a single block (16 bytes) of data for the three-adder and four-adder configurations. As more adders are provided, the total execution time decreases. Therefore, the utilization for the scratchpad, multiplier, divider, and communication unit increases. On the other hand, the occupancy for adder units decreases simply due to distribution of instructions to the extra adder. The multiplier units are used only for select and shuffled instructions. The divider unit and the two multiplier units can be replaced with adders, which also provide the same operations, so that the area can be saved with minimal performance degradation. The scratchpad utilization is not symmetric. This is mainly due to some array variables used other than the main T-table lookup. Add1
Add2
Add3
3-adder config.
58.0
55.2
55.2
4-adder config.
44.3
48.8
46.4
7.2 TA B L E
Add4 – 47.7
Mul1
Mul2
Div
SP1
SP2
Comm
19.4
19.4
7.6
34.9
26.3
19.1
21.2
21.2
8.4
39.8
28.1
18.6
Functional unit occupancy (percent) for AES encryption.
7
130 Size of packet stream
7.3
Packet Processing on a SIMD Stream Processor
Cycles per block
8
142.12
16
78.43
32
58.90
64
48.42
96
46.63
Performance of AES-OCB encryption (4-adder configuration).
TA B L E
Mode of operation We have so far considered only the Electronic Codebook (ECB) mode of encryption. This allows each data block to be processed independently, in parallel. However, a particular plaintext will always be encrypted to the same ciphertext. Therefore a codebook can be obtained and the privacy will be compromised once the relation between the ciphertext and plaintext is known. More sophisticated modes offer protection from repeated plaintextciphertext pairs, and some still allow packets to be processed in parallel. The Counter Mode (CTR) is one such mode that was recently added to NIST’s approved list [16]. As proposed on the NIST’s recent call for modes-of-operation [17], the Offset Codebook mode (OCB) [18] and the Carter-Wegman + Counter dual-use mode (CWC) [19] can also be operated in parallel. Another publication [15] discusses the details of our implementation of OCB mode. The performance for the four-adder configuration (with two 512-word scratchpads) is shown in Table 7.3. Performance is given by total runtime (including the time for generating the tag) divided by the size of the packet stream. Since the time for tag generation is fixed, the performance is degraded for smaller packet sizes.
7.2.3
AES Performance Summary Because we choose to interpret a packet as a stream of blocks to be encrypted, performance of the AES encryption algorithm is very dependent on packet size. Large packets amortize stream overhead over a number of blocks, while short packets suffer from the short stream effect. For ECB mode, throughput ranges from 2.02 Gb/s (96-block packets) to 0.8 Gb/s (8-block packets). For a realistic packet trace, large packets tend to dominate performance, resulting in 1.6 Gb/s. The best published performance for a 32-bit uniprocessor is 232 cycles per 16-byte block in ECB mode [20, 21]. Our best ECB performance, using eight
7.3
IPv4 Forwarding
arithmetic clusters on a 96-block stream, is 32 cycles per block. This is not a strictly level comparison, because the uniprocessor measurement assumes that all data resides in the L2 cache [20], while our measurements include the cost of moving data between the SRF and main memory. On the other hand, our block-parallel approach is not appropriate for some feedback-based encryption modes, such as CBC. Cryptography algorithms are commonly implemented as dedicated hardware accelerators for network processors. Such hardware can be integrated within the same die, as in Intel’s IXP2850, which contains two crypto-engines supporting IPsec at 10 Gb/s. Hardware support can also be in a form of security coprocessors supporting a wide range of security functions, as in Broadcom’s BCM5841 and Hifn’s HIPP III 80xx, both capable of supporting IPSec at multigigabit per second speed. We do not expect to outperform custom hardware, and AES should be sufficiently long-lived to justify a hardware approach [22]. However, a software-only solution may be justified in an environment with varying security requirements and/or extreme cost constraints.
7.3
IPV4 FORWARDING Routing and forwarding applications tend to be memory-intensive, involving a series of pattern matches and table lookups. Since IPv4 is the most widely used protocol for Layer 3 routing and involves the most amount of processing time (compared to other switching schemes like MPLS), we use an IPv4 forwarding algorithm proposed by Mehrotra [23] as a case study. This algorithm involves a series of lookups from a small, compact table, followed by a single lookup in a larger table to determine the output port. The benchmark includes only the route lookup portion of IPv4 forwarding. The IP Forwarding algorithm proposed by Mehrotra [23] employs a triebased scheme, wherein the routing table information is compacted enough to be stored in the on-chip SRAM of a modern day processor (around 512 KB to 1 MB). The table containing the next-hop values is stored in the DRAM (DRAM table). For this study, we use a 16-degree trie (i.e., each node has 16 children) with eight levels. The SRAM table is built by storing a 1 or 0 for every node of the trie, depending on whether it has child nodes or not. Correspondingly, each of the child nodes will be represented by either 1 or 0 in the SRAM table, based on whether it has child nodes or not. Thus, only the leaf nodes of the trie structure will be represented by 0’s in the SRAM table. The route lookup is done in two stages. The first stage involves only SRAM lookups using four bits of the address to index into each level of the SRAM table.
131
7
132
Packet Processing on a SIMD Stream Processor
Every level of the SRAM is traversed until the longest path corresponding to the address is determined from the bit-pattern stored (i.e., until it reaches a 0). The information from the SRAM yields the row and column address of the DRAM table, where the corresponding next-hop address is stored. A single DRAM access is then made to obtain the next-hop address. The algorithms for searching and generation of the trie are not discussed in detail here.
7.3.1
Design Methodology and Implementation Details The role of the IP Forwarding module [24] is to find the next hop address for every packet arriving at the router, based on a table of next hop entries stored in the DRAM. The unit of data (i.e., a record) being consumed by the forwarding engine is the extracted 32 bits of the packet header plus some control information populated by the bit extraction engine. The design of the bit extraction engine is described elsewhere [24]. The data flow of the forwarding engine is shown in Figure 7.6. The initialization kernel prepares the SRAM table, which is persistent across the iterations of the other kernels. [For the purpose of storing the SRAM table the memory hierarchy of Imagine is altered to add an on-chip memory (SRAM) in between the local memory of the clusters and the SRF.] The initialization kernel is run only when a change in the routing table results in a new triebased SRAM table. Incoming packets are organized as streams and sent to the hardware bit-extraction module, which is programmed to extract 32 bits from the header, depending on the type of routing employed. This packet data is sent to the forwarding kernel. The forwarding kernel performs the SRAM table lookups, visiting only as many levels of the trie as needed, then outputs a stream of row and column addresses of the DRAM table—one pair for each packet in
Persistent data
I/P data
INIT. KERNEL
Stream of packets
FIGURE
Data flow of the forwarding engine.
Next-hop address
O/P data
Packet data
SRAM table info
7.6
FORWARDING KERNEL
FORWARDING KERNEL
DRAM access
7.3
IPv4 Forwarding
the stream. This stream is sent to the DRAM as an indexed stream operation to obtain the next hop addresses of all the packets. The indexed stream access is used to maximize the bandwidth offered by the DRAM, thus reducing the DRAM access bottleneck faced by other forwarding engines. In an effort to minimize the performance degradation as a result of the sequential memory access, the forwarding operation is software pipelined [25], such that the memory access is overlapped with the computation of the kernels.
Forwarding kernel The forwarding kernel performs the essential task of obtaining the row and column addresses of the DRAM table for every packet. The kernel takes a stream of extracted packet data and outputs streams of processed and unprocessed packets. The forwarding kernel calculates the row and column address of each packet based on the information stored in the SRAM table. The row and column address calculation primarily involves a sum-of-1s operation, performed for every packet at every level of the table, which demands a lot of computation power. Mehrotra [22] satisfies the computational requirement of this sum-of-1s calculation by providing a dedicated cascade of adders. However, we have used certain software optimizations to enhance the performance. First, two SRAM table nodes, each 16 bits, are packed into a single 32-bit location in the SRAM. This maximizes the memory utilization and provides for a faster implementation, since it reduces the number of trips to the SRAM. Second, the sum-of-1s is calculated with the help of scratchpad reads, eight bits at a time, from an array containing precomputed sums-of-1s for all possible combinations of eight bits. Four scratchpad reads are needed to obtain the sum-of-1s for one SRAM location (i.e., two SRAM table nodes), which improves performance dramatically compared to a software implementation using a shifter and an adder. On completion of the sum-of-1s calculation, the processed packets are pushed to the output stream, while the unprocessed packets are processed again in the next iteration by the clusters. The processing time of each packet depends on the number of levels it traverses. Thus, the hardware utilization suffers as a result of some of the clusters being idle while the others are processing packets, due to the difference in the processing times of each packet. The stream programming model provides a solution to this problem in the form of conditional streams [4]— idle clusters can be replenished with packets once they are done processing. At the end of every iteration, processed packets are pushed to the output stream, and new packets are sent to the clusters that have processed the packets, thereby keeping all the clusters busy at any given point of time. There are a few unprocessed packets at the end of the loop when all the packets
133
7
134
Packet Processing on a SIMD Stream Processor
from the input stream are exhausted. The unprocessed packets will be less than the number of clusters and hence it would be a waste of resources to process them, since the cluster utilization would be very low. It makes more sense to leave them unprocessed and complete them with the next set of packets.
7.3.2
Experiments The different Imagine metrics have been characterized in an effort to identify the configuration of Imagine that delivers maximum performance. The input data set used for these experiments is a mixed set of synthetic and real traces. The synthetic packet traces have been constructed for the MAE West Routing table, which is from a backbone router. The three synthetic traces have been constructed in an effort to identify and observe the performance of the architecture under different input scenarios. The three traces are classified as follows: ✦
Maximum. The trace termed as maximum consists of packets that would hit all the levels of the SRAM table. The trace has been constructed to demonstrate the maximum execution time of the engine (i.e., minimum throughput) and is the worst-case scenario.
✦
Average. The average trace consists of packets that hit all the routing table entries at least once. This trace is randomly constructed, satisfying only the criterion that every entry in the routing table is hit at least once. It represents the average-case scenario, in between the worst case and the best case.
✦
Minimum. The minimum case identifies the best-case scenario and consists of packets that will hit only one level of the SRAM table.
Varying the size of the input stream (buffer) The results for the first set of experiments have been collected by varying the size of the input stream to the forwarding kernel. This parameter determines the size of the ingress packet buffer in an actual implementation and plays an important role in the design of a network processor. The execution time metric is the ratio of the total number of cycles to the total number of packets processed. It indicates the number of cycles required to calculate the next hop address for one packet. The overall execution time of a stream application is divided into the kernel execution time and the stream overhead time. Kernel execution time is the processing time of the forwarding kernel to generate the row and column address of the DRAM table. The stream overhead (Figure 7.7) is the latency of the sequential memory access plus the time
IPv4 Forwarding
Stream overhead (cycles/packet)
7.3
135 50 40 30 20 10 0
0
500 1000 1500 Size of input stream (packets) Maximum
7.7
Average
2000
Minimum
Stream overhead for forwarding.
FIGURE
Execution Time (cycles/packet)
100 80 60 40 20 0
32
64 128 256 512 1024 Size of input stream (packets) Maximum
7.8
Average
2048
Minimum
Total execution time for forwarding.
FIGURE
spent building and sending the stream data between the SRF and the DRAM or kernel. The execution time measurement is done with varying size of input streams in an effort to simulate the effect of the size of the input queue on the throughput of the forwarding engine. As shown in Figure 7.8, execution time decreases as stream size increases. Past a certain point, however, a further increase in the size of the input stream results in a slight increase in execution time. With stream inputs of very small size, the stream architecture suffers from the short stream effect, discussed earlier. As the size of the input stream increases, the processing time starts decreasing due to the disappearance of the short stream effect. The forwarding engine starts giving maximum throughput with the increase in the size of the input stream. After a given threshold of the size of the input stream, the throughput starts decreasing a little because of the memory stalls due to increased DRAM operation. The throughput is not badly affected by the memory stalls because the sequential memory access for the most part is
7
136
Packet Processing on a SIMD Stream Processor
hidden by kernel computation. The best throughput occurs when the sequential memory access is completely hidden by computation. As Figure 7.7 indicates, the stream overhead for the average case is higher than that of the other two cases. For every run of the kernel, there are few unprocessed packets in the average case, which are not present in the maximum or minimum case. Hence, all packets sent to the kernel in the maximum and minimum traces get processed for every call of the kernel. The unprocessed packets cause the average case to go more times to the SRF and the DRAM as compared to the other cases, thereby increasing the stream operations. The kernel execution time dominates the processing time once the short stream effect is eliminated. The kernel execution times for the three cases differ, since the amount of processing differs in each case.
Varying the number of clusters
Execution time (cycles/packet)
The second set of experiments varies the number of clusters. The number of clusters defines the processing power of the Imagine architecture and forms a key parameter in the performance of the forwarding algorithm. The experiments have been conducted in an effort to determine the advantages of packing in more hardware in the processor. The experiments have been run for a configuration of 8 and 16 clusters, with one input stream size. Results are shown in Figure 7.9. Increasing the number of clusters increases the number of packets that can be processed in parallel, thereby naturally increasing the throughput of the forwarding engine. The overall throughput will not increase twofold, since increasing the number of kernels will decrease only the kernel execution time, not stream overhead. The stream overhead limits the improvement realized by increasing the number of clusters.
80 60 40 20 0 Maximum
Average Type of traces 16 Clusters
7.9 FIGURE
Minimum
8 Clusters
Performance of doubling the number of clusters for forwarding.
7.3
IPv4 Forwarding
The amount of improvement is also dependent on the input data set being sent through the forwarding engine. For the maximum trace, in the case of eight clusters, the processing time of the kernel is too large to take advantage of the software pipelining techniques at the stream level. As a result, the kernel execution time is the slowest stage, and this results in the rest of the operations waiting for the kernel execution to complete. By increasing the number of clusters, more packets are processed simultaneously, thereby decreasing the processing time of the kernel. Hence, both the memory operations and the kernel execution have comparable execution times and can be overlapped. The minimum case exhibits a lower performance gain because the contribution of the kernel to the total execution time is comparatively less. Hence, adding more processing power to the kernels will not considerably affect the execution time. The kernel execution time contributes only 50 percent of the total, which is considerably less than the other two cases. The decrease in execution time for the minimum case will be only 25 percent as opposed to the nearly 45 percent decrease in the maximum case. The average case performs slightly better than the maximum case but worse than the minimum case. The increase in performance is limited by the number of unprocessed packets left for every kernel call. In the case of 16 clusters, the number of unprocessed packets sent after every kernel call will be <16, while in the case of 8 clusters, it will be <8. This increases the stream overhead, since more unprocessed packets from the previous kernel call are copied to the input stream, every time the kernel is called, and also the number of times the kernel is called will be higher.
Cluster statistics The cluster statistics is considered only for the forwarding kernel, ignoring the initialization kernel, since the initialization kernel is called only once and, hence, is of little importance in the performance analysis. The VLIW scheduler divides the forwarding kernel code into three basic blocks and maps them to the functional units in the cluster. The functional unit occupancies in each cluster are shown in Table 7.4. SP1 is the SRAM unit, used to store the SRAM table. In a real implementation, this will be a common memory accessible by every cluster. The common memory will be placed between the SRF and the clusters. However, due to the limitation of the simulator and for the sake of simplicity it has been simulated as a part of a cluster. The latency for an access is however increased (from 4 ns to 8 ns) to simulate the memory hierarchy. The functional utilization of the multiplier and divider is very low. The few operations executed on these units are the
137
7
138
7.4
Packet Processing on a SIMD Stream Processor
Add1
Add2
Add3
Mul1
Mul2
Div
SP
SP1
Comm
58.6
62.1
60.3
13.8
19.0
12.1
63.8
24.1
19.0
Functional unit occupancy (percent) for forwarding kernel.
TA B L E
7.5
Trace
Cycles per packet
Maximum
67.2
NCSU_Trace1
62.4
NCSU_Trace2
62.1
Average
53.48
Performance of forwarding for real vs. synthetic traces.
TA B L E
select operations, which could be executed by the adders instead. The divider and the multiplier are not required and should be removed in order to gain area. Also, the highest utilization is of the SP unit, which forms the bottleneck for the performance. (The utilization of SP is underreported, because some of the SP1 accesses would actually be performed by SP if a separate SRAM unit were provided.) Hence, removing the divider and the multiplier and making the adders do the extra work will not considerably degrade performance.
Performance for real packet traces Two real-world traces (NCSU_Trace1 and NCSU_Trace2) were obtained from two edge routers located at NC State University. The number of cycles taken to process each packet (see Table 7.5) is close to the absolute maximum the forwarding engine can support. The statistics for the real traces fall in between the results obtained for the maximum and the average case, indicating that performance is indeed bounded by worst-case performance modeled by the synthetic trace in which every packet traverses every level of the trie.
7.3.3
IPv4 Performance Summary The maximum sustainable throughput is defined as the maximum rate at which none of the received packet headers is dropped by the IPv4 Forwarding Engine. The Maximum trace exhibits worst-case lookup performance, because every lookup touches every level of the SRAM table. This trace recorded the
7.4
Related Work
highest time of 67.2 cycles per packet. Hence, the forwarding engine will not lose any packets if the packet transmission time is greater than 67.2 cycles. With a 500-MHz clock, the maximum bit rate allowed on the input ports assuming a 40-byte header is 2.4 Gb/s. The Network Processing Forum (NPF) has specified a benchmark for IPv4 forwarding [26]. We have considered only the processor performance during route lookup; we have not modeled media interfaces or implemented the control-plane functions measured by the NPF benchmark, and we have not implemented full IP forwarding as specified by RFC 1812 [27]. A test platform with two Intel IXP2400 processors, running at 600 MHz, achieved a throughput of 8 Gb/s on the NPF benchmark [28], compared to our 2.4 Gb/s with one 500-MHz Imagine processor. A hardware implementation of the forwarding algorithm [23] used three pipeline stages. Assuming the same SRAM access time (8 ns) as simulated in our experiments, each stage of the pipeline operated at 64 ns for the ASIC implementation, resulting in a throughput of 5 Gb/s for 40-byte packets. The configuration of Imagine with eight clusters operates at around half the speed of the ASIC implementation (2.4 Gb/s, worst case).
7.4
RELATED WORK Previous work has addressed the use of SIMD architectures for network processing. ClearSpeed’s CS301 is a highly-parallel coprocessor for high performance computing applications like packet processing, based on the Multithreaded Array Processing (MTAP) architecture [29]. This architecture incorporates both MIMD and SIMD parallelism. Seshadri and Lipasti [2] proposed a vector processor (which is related to SIMD) for packet processing applications. An IP Forwarding algorithm was used as case study to test this architecture. The performance was obtained by using a routing cache that utilized the temporal locality of packet destinations. They also proposed the XIMD approach, which minimizes the weakness of SIMD architectures to control flow variations. Examples of commercial network processors that use specialized coprocessors include offerings from EZ-Chip (NP-1), Agere (Payload Plus), and Vitesse (PRISM IQ 2000). Other processors like the Lexra, PMC-Sierra, etc. do not have coprocessors to do the work, but obtain performance with the help of specialized instruction sets. However, all these processors follow the MIMD approach. Crowley et al. [30] investigated the use of standard on-chip parallelism techniques, such as chip multiprocessing (CMP) and simultaneous multithreading (SMT), for network applications.
139
7
140
Packet Processing on a SIMD Stream Processor
Dally [31] proposed the use of stream processing architectures for network applications in a keynote speech at the first Workshop on Network Processors (NP-1). Amarasinghe et al. [32] implemented an IP routing application for the Raw processor, which can be considered a stream-oriented MIMD architecture. Also, recent programming environments for network processors, such as NP-Click [33], incorporate the notion of packet streams flowing between computational kernels. This style of programming should be well suited for a stream processing architecture.
7.5
CONCLUSIONS AND FUTURE WORK This chapter has explored the application of stream architectures to packet processing tasks, in particular IPv4 Forwarding and AES encryption. Both applications were run on a generic stream architecture (Imagine), and experiments were conducted to characterize the performance of both applications for different configurations of this architecture. For a system clock of 500 MHz, the throughput of the AES encryption in ECB mode varies from 2.02 Gb/s (96block packets) to 0.8 Gb/s (8-block packets). Throughput for a trace-based mix of packet sizes is 1.6 Gb/s. Using OCB mode, which also includes the calculation of a message authentication code, best-case throughput is around 1.4 Gb/s. The IPv4 Forwarding application, with a configuration of one Imagine with 8 clusters, delivered a worst-case performance of around 67 cycles per packet, for a packet trace constructed from the MAE-WEST routing table. Hence, the forwarding engine was able to support packet traffic coming at a rate of OC-48 assuming a clock frequency of 500 MHz. Table 7.6 shows performance metrics of the Imagine architecture for AES and IPV4, compared with the Imagine metrics for a two standard media applications: Depth and MPEG2 [3]. The comparative study with a media application,
Application
MEMORY_BW (GB/s)
SRF_BW (GB/s)
LRF_BW (GB/s)
GOPS
AES
1.12
2.23
239.90
20.66
IPv4
1.69
2.9
78.41
11.04
Depth
0.83
21.08
263.02
12.1
MPEG2
0.45
2.21
214.31
18.3
Imagine Peak
2.67
544
40
7.6 TA B L E
32
Performance metrics for networking (AES, IPV4) and media (Depth, MPEG2) applications. Peak values for Imagine are shown in the bottom row.
7.5
Conclusions and Future Work
for which the Imagine processor was originally built, gives us a better idea of the usefulness of this architecture for packet processing applications. The table shows the bandwidth for each of the three levels of memory hierarchy and the billions of operations per second (GOPS). The peak bandwidth and GOPS for the Imagine processor are shown in the bottom row. The LRF and memory bandwidth characteristics of the two packet processing applications confirm the fact that they are at two different ends of the application spectrum, with one being memory-intensive and the other being computationintensive. The low SRF characteristics for the two packet processing applications, compared to the Depth media application, is due to the fact that the processing of the packets for both applications is done primarily in one kernel. This results in reduced trips to the SRF between kernels, decreasing the SRF bandwidth utilization. On the whole, the packet processing applications have comparable metrics to that of media applications indicating this architecture could be as useful for network applications as it is for media applications. The base architecture was modified in our study to enhance the performance of the network applications. The modifications include an additional SRAM (between the SRF and the Clusters) in the memory hierarchy for the forwarding application and extra local memory (i.e., number of scratchpads) in each cluster for the AES encryption application. The number of functional units in each cluster was also varied to identify the optimum mix of the functional units for each application. Not surprisingly, we found that the multiplier and divider units were underutilized, and we advocate replacing them with more useful structures for packet processing. The two applications responded differently to the SIMD mode of execution employed by the Imagine processor. The forwarding application took advantage of the synchronized processing to achieve high-memory bandwidth. Load-balancing problems were eliminated by using conditional stream operations in the forwarding application. The encryption algorithm, however, did not take advantage of conditional streams to solve its problem with variable-sized packets, because we wanted to maintain packet ordering [34]. Instead, we opted to apply SIMD processing within the packet payload, with each cluster processing a separate block. We believe that our experience is inconclusive with respect to the effectiveness of the SIMD paradigm for packet processing applications; we will explore this further with additional applications in future work. The ultimate goal of this research is to propose an architecture based on the stream programming model for packet processing applications. Future work involves implementing other networking applications on this architecture in an effort to identify a common configuration of the stream architecture suitable for all the applications in the networking processing domain. A major part of this
141
7
142
Packet Processing on a SIMD Stream Processor
future study is a comparative study on the cost, area, performance, and powerdissipation metrics of the proposed architecture, relative to other known network processor architectures.
ACKNOWLEDGMENTS The authors thank Professor Bill Dally and members of the Stanford University Concurrent VLSI Architecture Group, especially Dr. Ujval Kapasi and Abhishek Das, for use of the Imagine toolset, and for answering many questions about Imagine and the tools.
REFERENCES [1]
N. Shah, “Understanding Network Processors,” MS thesis, University of California-Berkeley, September 2001.
[2]
M. Seshadri and M. Lipasti, “A case for vector network processors,” Proceedings of the Network Processors Conference West, October 2002, pp. 387–405.
[3]
B. Khailany, W. J. Dally, et al., “Imagine: Media processing with streams,” IEEE Micro, March/April 2001, pp. 35–46.
[4]
U. J. Kapasi, W. J. Dally, et al., “Efficient conditional operations for data-parallel architectures,” Proceedings of the 33rd International Symposium on Microarchitecture, December 2000, pp. 159–170.
[5]
U. J. Kapasi, W. J. Dally, et al., “The Imagine stream processor,” Proceedings of the IEEE International Conference on Computer Design, September 2002, pp. 282–288.
[6]
J. Daemen and V. Rijmen, “AES proposal: Rijndael,” NIST, September 1999, csrc.nist.gov/CryptoToolkit/aes/rijndael/Rindael-ammended.pdf.
[7]
J. Daemen and V. Rijmen, “A Specification for Rijndael, the AES Algorithm, V3.5,” April 12, 2001.
[8]
S. Rixner, W. J. Dally, et al., “A bandwidth-efficient architecture for media processing,” Proceedings of the 31st International Symposium on Microarchitecture, 1998, pp. 3–13.
[9]
W. Dally et al., “The Imagine Instruction Set Architecture,” February 20, 2002, cva.stanford.edu/ee482c/downloads.html.
[10]
C. S. K. Clapp, “Instruction-level parallelism in AES candidates,” Proceedings of the Second AES Candidate Conference, March 1999, pp. 68–84.
[11]
J. D. Owens, “Computer Graphics on a Stream Architecture,” Ph.D. dissertation, Stanford University, 2002.
[12]
National Laboratory for Applied Network Research, “NLANR network traffic traces,” pma.nlanr.net/Traces/Traces/mdata/AIX/PLen/20030605/1054837521-1.PLen.
References
143
[13]
National Laboratory for Applied Network Research, “NLANR MOAT PMA: AIX Site Information,” pma.nlanr.net/PMA/Sites/AIX.html.
[14]
D. Whiting, B. Schneier, and S. Bellovin, “AES Key Agility Issues in High-Speed IPsec Implementations,” May 2000, www.counterpane.com/msm.html.
[15]
Y. Lai and G. T. Byrd, “AES packet encryption on a SIMD stream processor,” in Embedded Cryptographic Hardware Methodologies and Architectures, editors: N. Nedjah and L. de M. Mourell, Nova Science, 2004, pp. 57–76.
[16]
National Institute of Standards and Technology, Recommendation for Block Cipher Modes of Operation—Methods and Techniques, NIST Special Publication 800-38A, December 2001, csrc.nist.gov/publications/nistpubs/800-38a/sp800-38a.pdf.
[17]
Computer Security Resource Center, “ CSRC—Proposed Modes of Operation,” csrc.nist.gov/-CryptoToolkit/modes/proposedmodes/.
[18]
P. Rogaway et al., “OCB: A block-cipher mode of operation for efficient authenticated encryption,” Proceedings of the ACM Conference on Computer and Communications Security, September 2001, pp. 196–205.
[19]
T. Kohno et al., “The CWC-AES Dual-Use Mode,” Crypto Forum Research Group, Internet Draft, 2003.
[20]
K. Aoki and H. Lipmaa, “Fast implementations of AES candidates,” Proceedings of the Third AES Candidate Conference, April 2000, pp. 106–120.
[21]
H. Lipmaa, “Survey of Rijndael implementations,” www.tcs.hut.fi/∼helger/aes/nfb.html.
[22]
W. E. Burr, “Selecting the advanced encryption standard,” IEEE Security & Privacy, March/April 2003, pp. 43–52.
[23]
P. Mehrotra, “Memory Intensive Architecture for DSP and Data Communication,” Ph.D. dissertation, North Carolina State University, 2002.
[24]
J. S. Rai, “A Feasibility Study on the Application of Stream Architectures for Packet Processing Applications,” MS thesis, North Carolina State University, 2003.
[25]
U. J. Kapasi, P. Mattson, et al., “Stream Scheduling,” Concurrent VLSI Architecture Technical Report 122, Stanford University, March 2002.
[26]
P. R. Chandra, ed., “IPv4 Forwarding Application-Level Benchmark Implementation Agreement,” Revision 1.0. Network Processor Forum Benchmarking Work Group, 2002.
[27]
F. Baker, ed., “Requirements for IP Version 4 Routers,” IETF RFC 1812, June 1995.
[28]
D. Ming, E. Eduri, and M. Castelino, eds., “IXP2400 Intel Network Processor IPv4 Forwarding Benchmark Full Disclosure Report for Gigabit Ethernet, Revision 1.0,” Network Processing Forum Benchmarking Work Group. March 5, 2003.
[29]
ClearSpeed Technology, “ClearSpeed CS301 Processor: An Advanced Multi-Threaded Array Processor for High-Performance Compute,” www.clearspeed.com/downloads/overview_cs301.pdf.
[30]
P. Crowley, M. E. Fiuczynski, J. B. and B. N. Bershad, “Characterizing processor architectures for programmable network interfaces,” Proceedings of the 14th International Conference on Supercomputing, May 2000, pp. 54–65.
7
144
Packet Processing on a SIMD Stream Processor
[31]
W. Dally, Keynote address at the HPCA8 Workshop on Network Processors (NP-1), February 2002.
[32]
S. Amarasinghe, G. Chuvpilo, and D. Wentzlaff, “Gigabit IP routing on raw,” Proceedings of the HPCA8 Workshop on Network Processors (NP-1), February 2002, pp. 2–9.
[33]
N. Shah, W. Plishker, and K. Keutzer, “NP-Click: A programming model for the Intel IXP1200,” Proceedings of the HPCA9 Workshop on Network Processors (NP-2), February 2003, pp. 100–111.
[34]
W. Bux et al., “Technologies and building blocks for fast packet forwarding,” IEEE Communications, January 2001, pp. 70–77.
8 CHAPTER
A Programming Environment for Packet-Processing Systems: Design Considerations Harrick Vin, Jayaram Mudigonda Department of Computer Sciences, University of Texas at Austin Jamie Jason, Erik J. Johnson, Roy Ju, Aaron Kunze Intel Research and Development Ruiqi Lian Institute of Computing Technology, Chinese Academy of Sciences, Beijing, PRC
The design of packet-processing systems is required to meet two, oftenconflicting, requirements: (1) support a large number of high-bandwidth links, and hence large system throughputs, and (2) offer a wide range of services as varied as conventional forwarding functions, VPN, intrusion detection, differentiated services, and overlay network processing. To simultaneously meet these requirements, a new breed of processors, referred to as network processors (NPs), has emerged [1–6]. Network processors, much like general-purpose processors, are programmable and include mechanisms—such as multiple processor cores per chip and multiple hardware contexts per processor core—that enable them to process packets at high rates. These mechanisms foreshadow a more general trend toward the design of multicore, multithreaded architectures targeted for high-throughput computing environments. NPs, when combined with general-purpose processors, fixed-function coprocessors, and other reconfigurable logic elements, create a powerful hardware platform for designing packet-processing systems. Unfortunately, the methodologies needed to map packet-processing applications onto NP-based packet-processing systems are unfamiliar to many programmers, and the tools to perform such mappings are in their infancy. Today, programmers are often required to manually partition each network application into components at design time and map these components onto different types
146
8
A Programming Environment for Packet-Processing Systems
and instances of processors within a packet-processing system. The different types of processors available within a packet-processing system (and even within an NP) are often programmed using separate programming environments. In most cases, general-purpose processor cores included in packet-processing systems are programmed using conventional programming languages (e.g., C), compilers (e.g., the GNU C compiler), and operating systems (e.g., Linux). Programming environments for special purpose cores in NPs, on the other hand, typically have their own tools and development methodologies that usually expose hardware details within the programming language and provide limited, if any, runtime systems for managing resources. The programmer, in turn, is required to develop hand-tuned code that carefully manages a variety of resources while ensuring that the system can process incoming traffic at the line rate. Such hand-tuned, resource-mapping decisions are made at design time and are based on the performance expectations of the application, expected workload, and exact hardware configuration of the system. Consequently, when an application is ported from one platform to another, the performance rarely scales as expected due to mismatches between the mappings, workloads, and the new hardware. Even recent attempts at complete programming environments for NPs expose most of the hardware details to the programmer and involve the programmer in mapping applications to packet-processing system resources [7, 8]. We predict that future generations of packet-processing systems will support a larger number and more types of processors, more diverse memory hierarchies, and more complex processor interconnects; hence, the difficulty in programming packet-processing systems will only increase over time. In this chapter, we describe our vision and design of a programming environment aimed at making future generations of packet-processing systems as easily programmable as today’s workstations and servers. Our programming environment consists of: (1) a domain-specific programming language for specifying packet-processing applications, (2) a compiler that incorporates profile-guided techniques for mapping packet-processing applications onto complex packetprocessing system architectures, and (3) a runtime system that dynamically adapts resource allocations to create systems that are more robust against attacks, achieve higher performance, and consume less power than current systems on similar hardware. The resulting environment facilitates rapid development of portable, high-performance packet-processing applications on programmable packet-processing systems. The rest of the chapter is organized as follows. In Section 8.1, we expose the characteristics of packet-processing applications and describe some of
8.1
Problem Domain
the common architectural features found in most NPs and packet-processing systems. In Section 8.2, we describe the overall design of our programming environment, with Section 8.3 providing details of each component of our design, focusing particularly on the key research challenges and requirements. Finally, Section 8.4 summarizes our contributions and current status.
8.1
PROBLEM DOMAIN Our objective is to design a programming environment that facilitates rapid development of portable, high-performance packet-processing applications on multicore, lightweight threaded packet-processing systems in general and current NP-based systems in particular. In what follows, we first expose the characteristics of packet-processing applications and describe some of the common architectural features found in most NPs and packet-processing systems. We argue that NP and packet-processing system architectures are qualitatively different from general-purpose systems and have a more complex processor structure. Further, packet-processing applications are structurally different from other applications where high performance (throughput) is required.
8.1.1
Packet-Processing Applications Packet-processing applications receive, process, and transmit data units— generally referred to as packets or cells.1 To understand the requirements these applications impose on system architectures and programming environments, we have analyzed several packet-processing applications and derived their fundamental characteristics: ✦
Packet-processing applications can be described using cyclic data-flow graphs [9]. Most of these graphs possess the following characteristics: — Many packet-processing functions (i.e., data-flow actors) are stateful— they maintain per-flow state. The state is accessed and updated while processing packets belonging to the flow. The statefulness of functions when combined with burstiness of traffic ensures that the number of packets belonging to different flows and of different types processed by the application can vary greatly over time; hence, the system does not reach a steady-state of operation.
1. In this chapter, the term packet is used to refer to any unit of network data transfer, including, but not limited to datagram, frame, cell, and packet.
147
8
148
A Programming Environment for Packet-Processing Systems
— The execution of processing functions in the data-flow graph is triggered by packet arrivals, timers, or other hardware events (e.g., link failure). The sequence of functions executed for a packet depends on the packet’s type (determined based on its header and content) and the state of the system. — Packet-processing functions can be represented as M-in, N-out data-flow actors. On processing a packet, some functions generate multiple output packets (e.g., multicast routing); some functions process multiple packets and generate one output packet (e.g., IP defragmentation); while some other functions generate output packets (e.g., keep alive) without consuming any packets.
8.1.2
✦
In most applications, there is little or no dependence between packets belonging to different flows.2 Hence, packet-processing applications exhibit a high degree of parallelism while processing packets from different flows. However, within flows, many packet-processing applications require—because of shared flow state or other ordering constraints—packets belonging to a flow to be processed and transmitted in a particular order.
✦
Throughput, as opposed to delay, is the primary performance metric for many packet-processing applications. Further, individual functions in most packet-processing applications are often not compute-intensive; their performance is dominated by memory-access latencies (resulting from operations such as compression table lookup, routing table lookups, etc.). Emerging packet-processing applications often make hundreds of memory accesses per packet and hiding those memory access latencies to achieve better throughput is paramount.
Network Processor and System Architectures To exploit the inherent flow-level parallelism present in this class of applications, NPs support multiple cores for processing packets. Many NPs also include general-purpose processors to support infrequent but complex operations, and special-purpose units to perform operations such as encryption and checksum computation. To hide memory access latencies, NPs often support hardware multithreading. With hardware multithreading, when a thread blocks, on a memory access for example, another thread within the same processing unit takes over and processes a different packet. Further, to reduce the memory access latencies and 2. Flows refer to a sequence of related packets (generally transmitted between two application endpoints).
8.1
Problem Domain
contention (and thereby limit the number of hardware threads needed to utilize processor cores fully), NPs often support a multilevel memory hierarchy. To enable multiple processor cores to communicate effectively, NPs typically contain at least one, but usually many, forms of interprocessor communication (IPC) mechanisms. For example, interthread signaling mechanisms are generally fast, but convey only a coarse granularity of information, whereas atomic shared memory operations can be quite fine grained, but have longer access latencies. NPs often also provide hardware support for rings, queues, and blocking and nonblocking forms of synchronization. NP architectures from multiple vendors instantiate these architectural features [1–6]. Depending on the types of applications to be supported, a packet-processing system may combine one or more NPs, general-purpose processors, coprocessors, and programmable logic elements. Thus, from a programmer’s perspective, NPs and packet-processing systems contain processors with heterogeneous capability—some processors may be fully programmable (e.g., the microengines and the Intel XScale® processors in the Intel IXP2800 network processor), some may only support configuration (e.g., a classification coprocessor), while some others may be fixed-function (e.g., a hash/crypto unit). Processors may communicate with each other using several different communication mechanisms (e.g., shared memory and message passing) with different bandwidth and latencies of communication. Finally, the memory system may support specialized functional capabilities (e.g., atomic increment/decrement operation, ternary CAM).
8.1.3
Solution Requirements The unique characteristics of packet-processing applications, when combined with the sophistication of NP hardware, have resulted in a lack of good tools to automatically map such applications onto NPs. Today, application designers map applications to hardware manually at design time. The complexity of such mappings and their variability across applications, architectures, and workloads makes programming NPs complex, tedious, and error-prone. Furthermore, when applications developed for one platform are ported to other platforms the performance does not scale as expected, and packet-processing systems so designed cannot adapt to dynamic changes in traffic conditions. These observations lead to five requirements for designing a programming environment for packet-processing systems: 1. The programming environment should abstract away architectural details from the programmers. It should allow programmers to specify entire
149
8
150
2.
3.
4. 5.
A Programming Environment for Packet-Processing Systems
packet-processing applications (control- and data-path functions) without partitioning them across different types of processors. The programming environment should automate the allocation of processor, memory, and communication resources available in packet-processing systems to packet-processing functions. The programming environment should support mechanisms for dynamic adaptation of resources allocated to applications. These mechanisms should allow packet-processing systems to provide performance guarantees to flows under fluctuating traffic conditions as well as ensure robustness of the system under attacks. The programming environment should enable packet-processing systems to achieve performance comparable to hand-tuned systems. The programming environment itself should be retargetable and extensible.
In what follows, we describe our approach to designing a programming environment for packet-processing systems.
8.2
SHANGRI-LA: A PROGRAMMING ENVIRONMENT FOR PACKET-PROCESSING SYSTEMS We apply the classic divide-and-conquer methodology of system design to develop a programming environment that meets our requirements. The architecture of our overall system, referred to as Shangri-La, is illustrated in Figure 8.1. To facilitate the development of packet-processing applications, Shangri-La defines Baker, a modular, domain-specific programming language that allows a network application to be specified as a composition of packet-processing functions (PPF). Baker allows unified specification of entire packet-processing applications (control and data path). A Baker application is compiled to an abstract machine [captured by an Intermediate Representation (IR)], and is fed into the compilation system of Shangri-La. The compilation system of Shangri-La consists of three components: the profiler, pipeline compiler (or π-compiler), and aggregate compiler. The profiler (1) derives code and data structure profiles—such as the locality properties of data structures, frequencies of executions for different PPFs, the amount of communication between each pair of PPFs, etc.—by emulating the execution of the network application under representative traffic conditions; and (2) annotates the IR with the resulting profiles. It is important to note that the profiler emulates the abstract machine without any knowledge of the target
8.2
Shangri-La: A Programming Environment for Packet-Processing Systems
Debugger
Baker
Profiler
Pipeline(p)compiler
Aggregate compiler
Run-time system System model Network system hardware
8.1
Shangri-La system architecture.
FIGURE
packet-processing system architecture; hence, it can derive only a functional profile, and not a cycle-accurate performance profile for the applications. The annotated code is subsequently fed to the π -compiler. The π-compiler addresses the question: How should the application (code and data structures) be organized to use the available packet-processing system resources effectively? First, it uses the data structure profiles to derive a strategy for explicit management of the memory hierarchy. This may involve mapping or prefetching data structures to specific levels of the memory hierarchy, as well as identification of data structures that should be managed using different memory management policies (e.g., selecting different caching policies for different data structures). Second, the π -compiler clusters PPFs into appropriate pipeline stages (referred to as aggregates), determines the allocation of resources for each aggregate (e.g., whether or not an aggregate needs to be replicated), and derives an initial mapping of aggregates onto the multiple processors available in the packet-processing system. The aggregate compiler uses the aggregate definitions and memory mappings from the π-compiler and produces an optimized binary for each of the target processor types available in the packet-processing system. The aggregate
151
8
152
A Programming Environment for Packet-Processing Systems
compiler performs traditional compiler tasks, machine-dependent optimizations, and domain-specific transformations. Finally, the runtime system (RTS) includes facilities to load and execute aggregates; monitor traffic fluctuations, system performance, and power consumption; and adapt the resources allocated to aggregates such that application performance and energy consumption targets can be met even in the presence of dynamic fluctuations in traffic conditions. Additionally, the runtime system includes a resource abstraction layer that exports high-level interfaces for a wide range of hardware resources, while hiding the details of their implementation (e.g., exporting the communication channel as an abstraction while hiding the details of how the channel is realized on the hardware). The runtime system binds the interface to an appropriate implementation at load/runtime. This facilitates dynamic adaptation of resource allocations to aggregates without requiring a recompilation. Observe that architectural properties—such as the number and types of processing cores, properties of different levels of the memory hierarchy, etc.—need to be exposed to the π-compiler, the aggregate compiler, and the runtime system. Shangri-La captures these architectural details in a system model; thus allowing the π -compiler, the aggregate compiler, and the runtime system to be parameterized for different packet-processing system hardware configurations by altering the model.
8.3
DESIGN DETAILS AND CHALLENGES Each of the three major components of the Shangri-La architecture—language, compiler, and runtime system—poses its own set of research challenges and key design considerations. The following sections describe these challenges along with the key design considerations and tradeoffs we have made for Shangri-La.
8.3.1
Baker: A Domain-Specific Programming Language A language in Shangri-La must play two roles: (1) as the interface to the programmer for expressing the application semantics, and (2) as an interface to the compiler to enable the generation of efficient executables on the target system. While the first role of a language is undeniably important, for the Shangri-La architecture and the Baker language, it is the second role that is crucial. The design of Baker, thus, has primarily focused on the second role by asking the question: What does the Shangri-La compiler complex require to perform its
8.3
Design Details and Challenges
153
task of automated mapping and efficient code generation? Secondary to this question, we have tried to answer the question: How does the programmer efficiently express the application? Fortunately, in most instances, both questions can be satisfied by creating domain-specific constructs from which the compiler can infer concurrency information and using which the programmer can easily express application semantics.
Baker language features The Baker language is syntactically similar to C, but with data-flow and other packet-processing domain extensions. Like all data-flow languages, Baker has actors, called packet-processing functions (PPFs), interactor communications conduits, called channels,3 and data appropriate for usage on channels, called packets. These constructs are illustrated in Figure 8.2 and the code segment in Figure 8.3, and are explained in more detail next. An Ethernet bridging and IPv4 forwarding application (an “L3 switch”) could contain PPFs for route lookup and bridging. It also contains configuration and management code that adds and removes interface information from an interface table (e.g., up, down, MAC addresses, etc.).
(C)
Module L3Swith
Configuration and management code
(A) (D)
L2 Cls PPF
(E1)
(G)
Bridge PPF
(E2) LPM PPF
8.2
Baker data-flow constructs.
FIGURE
3. Or communications channels.
Data
(F1)
(F2)
(B)
8
154
A Programming Environment for Packet-Processing Systems
1. // "Protocol" definition 2. protocol Ethernet { 3. dest : 48; 4. src : 48; 5. len_type : 16; 6. snap : anyof { 7. { len_type > 1500 }: 0; 8. default: LLCSNAP; // Not shown 9. } 10. demux { (len_type<1500) ? 22*8 : 14*8 }; 11. }; 12. // L2 Classifier process function 13. void L2Cls.process(Ethernet_packet_t* p) { 14. IPv4_packet_t* ip; 15. if (p->len_type < 1500) processSnap(p); 16. else 17. if (p->len_type == 0x800 && 18. isInIfaceList(p->dest)) { 19. ip = packet_decap(p); //Uses demux value 20. channel_put(forward_chnl, ip); 21. } else 22. channel_put(bridge_chnl, p); 23. }
8.3
24. module L3Switch { 25. ppf L2Cls; //forward declarations of 26. ppf Bridge; //required PPFs 27. ppf LPM; 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42.
channels { input Ethernet_packet_t input_chnl; //(A) output packet_t output_chnl; //(B) } wiring { //equate this module’s external channels input_chnl = L2Cls.input_chnl; //(A=D) output_chnl = Bridge.output_chnl; //(B=F1) output_chnl = LPM.output_chnl; //(B=F2) //bind internal PPFs channel endpoints L2Cls.bridge_chnl -> Bridge.input_chnl;//(E1) L2Cls.forward_chnl ->LPM.input_chnl; //(E2) } //module’s data iface_table_t iface_tbl;
43. //module’s interface 44. void add_interface(iface_t r); 45. void del_interface(iface_t r); 46. };
Sample Baker code.
FIGURE
PPFs contain the actual packet-processing code in an application. This code is expressed in much the same way a C function is expressed (lines 13–23). The inputs to the function represent packets from the input channel endpoints of the PPF. The function explicitly places packets on its output channel endpoints (lines 20, 22), and the data structures accessed are those available within the scope of the PPFs compilation unit. Currently, Baker allows only a single instance of any PPF in an application. One interesting topic for further exploration is how to instantiate PPFs without sacrificing the efficiency of PPF’s accessing per-instance data structures. Channels carry data, typically packets, between the output and input channel endpoints of PPFs. Channels represent a wiring of the data-flow through the PPFs of an application. PPFs can have passive input channel endpoints, in which data arrival implicitly invokes the processing code of the PPF (e.g., although not shown, L2Cls.input_chnl is passive and so the L2Cls.process function is invoked implicitly). PPFs can also have active inputs, in which retrieval of data from the channel is explicit in the packet-processing code. Active inputs enable the programmer to express such PPFs as a scheduler in a quality-of-service application.
8.3
Design Details and Challenges
In Baker, packets are declared to be of a new, first-class data type, packet_t. Packets can be accessed through a protocol specification and associated with meta-data, as follows: ✦
Protocol specifications, which are written by Baker programmers, enable packets to be viewed according to a particular layout (lines 1–11). For example, a bridging PPF may choose to view a packet through an Ethernet protocol whereas, a routing operation may choose to view the same packet through an IPv4 protocol. Protocol specifications insulate the programmer from packet memory layout and alignment issues; the compiler and RTS may decide to represent packets in the most appropriate manner for the target hardware—for example, as a contiguous block of memory, or as a chain of buffers—without the Baker programmer knowing.
✦
Meta-data is used to convey per-packet information, such as input and output port, and flow identifiers, between PPFs. Meta-data is user-defined in Baker, is created by one PPF and consumed by another, and is carried with a packet through the channels of Baker.
We are currently exploring how Baker can specify flows of packets. Our current approach uses meta-data to define flows such that packets within the same flow can then be ordered and serialized according to the specification of the programmer. For example, all of the packets entering an order-sensitive PPF could be ordered according to a flow-ID-based piece of meta-data as well as a monotonically increasing sequence number (e.g., a received packet number). However, more work remains as this is a key area for a packet-processing language. Finally, although not strictly part of a data-flow model, Baker defines modules (lines 24–46), which represent a namespace for other modules, PPFs (lines 12–23), channels (lines 28–31), shared data (line 42), and configuration and management code (lines 44–45). Modules may also contain input and output channels of their own (e.g., A, B) that are wired to the input and output channels of their contained PPFs (e.g., A wired to D). Constructs such as module and configuration code are necessary for the expression of a complete packet-processing application (including control-plane processing), as well as being convenient for the organization of a programmer’s code. Similar to PPFs, only one instance of a module can currently exist in a Baker application, and an interesting area of further research is how to instantiate modules efficiently.
Using Baker language features Just saying Baker is a data-flow language is not sufficient for the programmer to properly understand how the compiler will extract the parallelism of the
155
8
156
RX
A Programming Environment for Packet-Processing Systems
PPF
PPF
TX
PPF
RX
Potential thread life
8.4
TX
PPF
One or more identical threads
Passive input Active input
Baker’s implicit threading model and passive and active inputs.
FIGURE
application. PPFs and channels must be able not only to express the packetprocessing application characteristics stated earlier (e.g., statefulness, flowspecifications, etc.), the programmer must decompose the application into these constructs (which may be a nontrivial task in itself) while understanding how this decomposition may affect the final performance and correctness of the generated code. To this end, Baker defines an implicit threading model. A programmer must assume that any PPF may be replicated, and hence execute concurrently on multiple threads. The programmer cannot create or destroy threads, however. Instead, the programmer must assume an implicit threading model as illustrated in Figure 8.4. In this threading model, each input can conceptually be thought of as an independent thread. In this sense, channels are akin to queues; however, Baker does not strictly enforce this property of PPF channels. Instead, while the programmer thinks of channels as queues, the compiler may implement channels through either function calls or queues (but still must preserve the queue-like semantics of course). As we describe in the following section, this nonstrict definition of channels is important for the compiler to maximize throughput.
Implications of Baker language features: A compiler’s perspective In order to generate code that can achieve high throughput on an NPstyle parallel system architecture, a compiler must be able to derive and exploit the inherent parallelism in the application’s code and data. Baker enables the Shangri-La compiler complex to understand those functions that
8.3
Design Details and Challenges
are independent (i.e., PPFs), as well as the data on which those functions can be replicated (i.e., packets and their ordering constraints). However, while Baker does enable the compiler to extract the inherent parallelism of the application, it is equally important that Baker does not enforce any constraints that could limit the compiler’s ability to exploit such parallelism. The implicit threading model and nonstrict definitions of channel implementations of Baker are examples of where we made a conscious decision in the language to not restrict the choices of the compiler and runtime system. These two language features enable the compiler to decide exactly how to map the PPFs to processing cores, which is important because this mapping may depend on workload and the hardware architecture as well as the application itself. For example, the workload may change the locality properties of data, which affects where queues should be placed in the processing pipeline—something possible because of the nonstrict definitions of channel implementations in Baker. As for hardware architecture considerations, the exact number of threads within a processor dictates the relative compute-to-I/O ratios of code that should be executed on those processors, and hence the implicit threading model of Baker enables the compiler to control exactly onto how many threads a PPF is replicated. While no one existing language meets our requirements for Shangri-La and Baker, some parts of existing languages contain useful concepts that we have borrowed. The most notable of these is the data-flow concepts from Click [9]. However, while Click, as well as languages for other extensible router frameworks—Genesis [10], NetScript [11], NetBind [12], VERA [13], Scout [14], Router Plugins [15, 16], and PromethOS [17]—support creation of network applications through composition of modular components [18], most of these languages utilize C/C++ or other general-purpose programming languages for developing the modules; hence, it is difficult for the compiler to extract the concurrency information for efficient mapping of applications onto packet-processing system architectures. In many of these systems, the mapping of components onto hardware resources is performed by hand, or the hardware platform assumed is uniprocessor, or the languages restrict the choice of mappings so as not to account for workload or hardware architecture variations. One solution to extracting parallelism from general-purpose languages is through language extensions such as OpenMP and related work [19]. Although these solutions ease the burden of extracting parallelism in programs, they tend to introduce too much overhead (e.g., explicit fork/join) or don’t lend themselves to the type of parallelism inherent in packet processing (i.e., pipelined functions as opposed to loops and vector operations). Finally, languages, such as microC from Intel [5] or picocode from IBM [3], expose hardware details to the programmer, and hence fail to meet our basic
157
8
158
A Programming Environment for Packet-Processing Systems
requirements of portability. Similarly, languages such as Network Classification Language (NCL) [5] and Functional Programming Language (FPL) [20] offer only limited expressibility; programs expressed in these languages do not completely describe all of the packet-processing operations.
8.3.2
Profile-Guided, Automated Mapping Compiler The compiler complex of Shangri-La consists of the profiler, π -compiler, and aggregate compiler.
Profiler In the Shangri-La architecture, the runtime characteristics of a network application—such as the locality properties of data structures, frequencies of executions for different PPFs, the amount of communication between each pair of PPFs, etc.—are used to guide the allocation of processor, communication, and memory resources of packet-processing systems to applications. Such profile-driven compiler optimizations are not new; code layout, for example, has previously been improved through profile-guided optimizations. However, in most previous profile-guided optimizations, the profile data has been derived by first compiling the code without profile information, then executing this code with instrumentation to gather the profile information, and finally recompiling the code with the newly gathered profile information. In the Shangri-La architecture, we do not believe this approach to be feasible because it requires a reasonable first compilation and mapping of the application without profile information. In addition, collecting profile information in hardware requires intrusive instrumentation of the code, or is restricted to those statistics available through hardware-based performance monitoring units. Instead, the Shangri-La profiler derives code and data structure profiles by emulating the execution of the network application using the IR produced by the Baker parser. In addition to the IR produced from the Baker language, to profile the runtime characteristics of a network application, the profiler needs application state that contains any persistent state that the network application uses to determine actions performed on packets—this includes, for instance, a route table, flow-classification data structures, and any per-flow state; and a packet trace that identifies a representative mix and arrival pattern for packets at the target packet-processing system. The profiler derives statistics for the properties of interest through a functional emulation of the application under these representative conditions. Because the profiler emulates the abstract machine with sample packet traces, the compilation time is certainly increased but this cost
8.3
Design Details and Challenges
is expected to be justified with a gain in runtime performance. Examples of profiling information include execution frequencies of code sequences, amount of data communicated through channels, access frequencies of data objects, etc. This information can guide a variety of program transformations and code optimizations. For example, execution frequency can influence code layout, and memory access frequency can help determine the layout of data objects across the levels of memory hierarchy. Given that the profiler is invoked at an early stage of compilation, the abstract machine emulates based on the programming model at the source language level and does not assume much knowledge of target processors. Therefore, the abstract machine is not expected to provide accurate performance information. Although conceptually straightforward, the design and implementation of the profiler poses one primary challenge: scalability. Any limits in the scalability of the profiler are due to at least two factors: (1) complexity of functional simulation, and (2) difficulty in dealing with packet traces. To control the complexity of the simulation environment requires a parameterized simulator, wherein the level of detail can be refined selectively and progressively. Further, since the refinements may depend on the traffic, the profiler needs to be self-tuning. We are exploring the design of a profiler that allows controlled, progressive refinement of the profile studies. While designing the profiler, a key challenge will be to identify specific properties of interest and then derive appropriate sampling techniques that can reduce the profiling complexity considerably.
Pipeline compiler The pipeline compiler (π-compiler) partitions a packet-processing application into a series of tasks (called aggregates), which form the processing stages in a pipeline. On IXP-based NPs, for example, these pipeline stages can be mapped to multiple chained microengines (MEs) as well as the Intel XScale® processor. The π-compiler has two primary functions: (1) it manages the memory hierarchy to minimize average memory access times; and (2) it groups packet-processing functions into aggregates such that these aggregates, when mapped onto the multiple processor cores, can maximize the overall throughput. While the π-compiler derives aggregates, it is important to have a wellengineered cost model to consciously guide each aggregation step. The cost model includes factors such as the cost of communication, synchronization overhead, memory access latencies, CPU execution times, and code size. Although it may sound appealing to simply minimize the processing time of the dominant stage in the partitioned tasks to maximize the rate of packet processing in the pipeline, this tends to split the PPFs of an application into too many aggregates,
159
160
8
A Programming Environment for Packet-Processing Systems
increasing communication cost and the number of processor cores allocated in the pipeline. Since the pipelined tasks can be replicated as multiple threads on one or more MEs to process multiple packets in parallel, it is important to balance the rate of pipelined tasks (i.e., the number of packets processed in the pipeline within a given time) and the available amount of parallelism to concurrently process the packets in replicated pipelines. The ultimate objective is to maximize the number of packets processed in the complete system within a given period of time. There is a large body of parallel programming research on designing algorithms for mapping computation onto multiprocessors [21–27]. The research can be broadly classified into two categories. The first category of research focuses on the problem of mapping parallel (data- and task-parallel) computations on multiprocessors [21, 22, 24, 27]. Most of these techniques derive a mapping such that the execution time of a single instance of the program is minimized. For packet-processing applications, on the other hand, the optimization criterion is maximization of average- or worst-case packet-processing throughput. The second category of research addresses the problem of mapping pipelined computations (e.g., streaming and DSP applications) onto multiprocessors [28–30]. This work is more closely related to the problem at hand. However, most of the prior work makes assumptions that all the units of work go through a single sequence of pipeline stages, and pipeline stages are performing computationally intensive tasks (hence, when two pipeline stages are fused to create a new pipeline stage, its execution time requirement can be estimated simply as the sum of the execution time requirement of the component stages). These assumptions do not hold for packet-processing applications. As we have argued earlier, at any instant, a packet-processing system may process multiple packets, each of which may execute a different sequence of functions. Therefore, in this work, we are investigating novel algorithms for clustering, allocation, and mapping of packet-processing applications onto heterogeneous, multiprocessor architectures. Although it is widely known that packet data structures (e.g., packet header, payload, packet meta-data) have little locality, we have shown that application data structures (e.g., per-flow state such as a meter, header compression state, a trie used to organize IP route tables into an efficiently searchable structure) in packet-processing applications exhibit considerable locality of access. Because of the inherent differences in their locality properties, these different types of data structures often interfere with each other and thereby lower the effective hit rate of the memory subsystem. Hence, a single hardware-based mechanism for managing the cache hierarchy is ineffective for packet-processing applications. In our system, the π-compiler will use the access frequency, object size,
8.3
Design Details and Challenges
and other object and data locality properties collected or derived by the profiler to determine an appropriate memory-hierarchy management policy. This may involve allocating data structures at different levels of the memory hierarchy, distributing data structures across memory banks at the same level of the hierarchy for load balancing [31], and using controlled prefetching. The π -compiler represents an entire packet-processing application as a PPF graph, where each node represents a PPF and each edge represents a communication channel. The inputs to the aggregation and memory mapping functions in the π -compiler are the PPF graph, a high-level representation of code sequences, and the symbol tables. We extend an existing framework of interprocedural analysis to perform a set of analyses across functions to characterize objects, computation, communication, and instruction stores. These analyses provide essential information to each step of aggregate clustering, allocation of various types of resources, placement of aggregates to MEs, and mapping of data structures to memory hierarchy. These decisions not only determine the quality of code produced by the rest of the compiler but could also influence the adaptation performed by RTS. Annotations on how aggregates are placed and replicated as modeled by the π -compiler are passed to RTS to allow efficient mapping, while RTS is free to adapt resource allocation and mapping by observing the system load. Given a set of aggregates, an aggregate construction phase in the π -compiler generates the necessary glue code to tie together the PPFs within an aggregate as well as across aggregates. For instance, since each aggregate executes continuously and processes a stream of packets, the aggregate constructor maps each aggregate to a thread and introduces the code necessary to dispatch packets to the appropriate PPF upon their arrival. Similarly, if an aggregate can receive packets from multiple aggregates, the construction phase incorporates the appropriate scheduler to ensure that different types of packets don’t interfere with each other’s performance. The whole compilation infrastructure of Shangri-La incorporates an iterative compilation feature. This provides the system with opportunities to refine the decisions made in an earlier compilation and with a higher chance to approach an optimal solution. In an iterative compilation framework, it is important to identify the type of events and statistics to be monitored and fed back with proper mapping and to design a robust feedback loop to guide subsequent compilations toward a better solution. However, an iterative compilation framework still requires high-quality cost models and heuristics in the compiler to guide the optimizations during each iteration of compilation. An iterative framework with sloppy heuristics may never converge to an optimal solution.
161
162
8
A Programming Environment for Packet-Processing Systems
Aggregate compiler The aggregate compiler receives from the π -compiler a set of aggregates, their mappings to hardware processing cores, as well as a policy for managing the memory hierarchy. The aggregate compiler performs both machine-dependent and machine-independent optimizations with the objectives of maximizing performance and throughput of each aggregate. For each aggregate mapped to a target processing core, the aggregate compiler produces the output in the form of assembly or object code along with a set of annotations used by the RTS. It is common for an NP to contain multiple types of processing cores. The aggregate compiler needs to generate multiple versions of aggregate code in different ISAs for each aggregate that may be mapped to multiple types of processing cores. Many compiler analysis and optimization techniques developed for generalpurpose compilation are applicable to compiling packet-processing applications on NPs. For example, interprocedural analysis performs various types of analysis across functions to provide sharpened analysis results to many subsequent optimizations. Memory optimizations, such as placing data prefetches and reordering data layout or object fields, can hide the latency in memory accesses or improve the spatial locality of accessed data items. Full or partial redundancy elimination can remove redundant computation and memory references appearing on all or some execution paths. Most of the Shangri-La components introduced thus far are independent of target hardware. However, the code generation (CG) component, where native code is generated and many processor-dependent optimizations are performed, is expected to vary significantly from one processing core to another, since the different types of processing cores on each NP often have dramatically different ISAs and micro-architectural implementations. On the IXP NPs, the Intel XScale® processor is a general-purpose processing core and has been adopted in the designs of various embedded systems. Hence, we leverage existing technologies and tools to generate code for the aggregate mapped to the Intel XScale® processor. On the other hand, new technologies are being developed to optimize the aggregates on MEs because they have a major impact on the overall throughput of packet-processing applications running on IXP NPs. Furthermore, the design of the MEs, which target the efficient processing of packets, poses a number of challenges to compilation. We discuss several of these challenges next: 1. Fragmented memory hierarchy. On the MEs, the memory hierarchy is divided into a number of levels including local memory (LM), scratchpad, SRAM, and DRAM. LM is local to each ME, whereas the rest are shared by all MEs and the Intel XScale® processor. Unlike a traditional cache structure
8.3
Design Details and Challenges
managed by the hardware, the address spaces on different levels of the memory hierarchy are distinct, and require different types of instructions and register classes to access different memory levels. No operating system or support of a single virtual address space exists on the MEs. As in the majority of high-level programming languages, procedure invocation is supported in Baker for the sake of programmability and modularity. A call stack is a typical means to support general procedure invocation. However, implementing a stack on fragmented memory hierarchy is nontrivial since the compiler needs to track whether the stack has outgrown the allowed space at a particular memory level. Generating runtime checks to select among multiple code sequences for the different memory levels is inefficient in both performance and code size. We are investigating an interprocedural stack management framework to allow a statically determined mapping from a stack location to a particular memory level. Another issue due to fragmented memory hierarchy is on pointer dereferences. If a pointer may point to different memory levels, there is no easy way to generate efficient instructions to dereference the pointer. One way to address this issue is to force the objects that may be pointed to by the same pointers to be allocated onto the same memory level. A congruence-based points-to analysis may help partition objects into congruence classes to map the object allocation to different levels of the memory hierarchy. 2. Register aggregates and partitioned register classes. The ME ISA allows a number of registers to be used implicitly in an instruction through register aggregates or indexed registers. This adds complexities to several phases in the CG. The IR in CG has to be capable of representing the implicit registers while maintaining both time and space efficiency. Instruction scheduling needs to capture the dependencies among all explicit and implicit operands while reordering instructions. Register allocation needs to perform liveness data flow analysis and color those register operands inferred from a compact representation. The registers on an ME are divided into a number of classes. For example, each level of memory hierarchy has its dedicated register classes to move data in and out of its memory. The class of general-purpose registers on an ME is further divided into bank A and bank B. Each instruction often has constraints on the legal combinations of register classes or banks for its operands. This poses a challenge on allocating proper registers to each live variable while minimizing the number of moves among different register classes and banks. 3. Architecture irregularity. In addition to the constraints on allowed register classes and banks for each instruction, there are other irregularities in
163
8
164
A Programming Environment for Packet-Processing Systems
the ME ISA. For example, the registers on ME can be accessed using a context-sensitive mode or an absolute mode, where the former treats each register local to a thread and the latter provides a means to access any register across all threads on an ME. Although the absolute mode is an effective way to communicate among different threads on an ME, not all instructions can refer to registers using the absolute mode. Hence, this limits the use of absolute mode. Figure 8.5 shows the block diagram of the ME code generator. The phases and their ordering are similar to other code generators, but the ME CG has to answer the design challenges mentioned earlier. The code selection phase translates from a high-level IR to CG IR, which has a typical one-to-one mapping to ME instructions. The memory optimizations generate efficient code sequences to access memory, for example, to combine multiple loads that access adjacent data fields into one load with a register aggregate target. The Control Flow Optimization (CFO) and Extended Basic-Block Optimization (EBO) phases perform simplification on control flow and classical optimizations on the scope of extended basic blocks, respectively. Loop optimizations, such as loop unrolling, focus on loop structures. The global scheduling may reorder instructions to reduce critical schedule lengths across basic blocks. Register allocation colors all virtual registers with proper register classes and banks. If there are register spills, local instruction scheduling is invoked to reschedule the instructions in the affected basic blocks. The code emission phase emits assembly code as well as the annotations passed to RTS. The system model abstracts architectural and micro-architectural details, such as latency, instruction opcode, the number of MEs, and the size of each level of memory hierarchy, into a separate module. Such information is used throughout the entire compiler. The code size guard regularly checks the current usage of instruction store. When the current code size approaches its physical limit, many optimizations that may increase code size, such as loop unrolling and scheduling with code duplication, are restricted, while optimizations that reduce code size, such as redundancy elimination, are made more aggressive. The heuristics to make intelligent tradeoff between code size and performance remain an important subject to explore.
8.3.3
Runtime System Although it may be possible to acquire packet traces that are broadly representative of the workloads that are presented to a packet-processing application, it is likely that these traces will differ from the workloads presented to
8.3
Design Details and Challenges
165 Pi compiler
Aggregates represented as WHIRL ME code generator Lowering & code selection
Memory optimization
Region formation Global instruction scheduling
Code size guard
System model
Loop optimization
Register allocation
Local instruction scheduling
Code emission
Assembly files
8.5
Microengine code generator phases.
FIGURE
the application when it is deployed in the field. After all, network applications are deployed in extremely diverse environments. A wireless access point may be deployed in a small business environment with file and printer sharing being the dominant application. The same model of wireless access point could also be
166
8
A Programming Environment for Packet-Processing Systems
deployed in a residential environment where web surfing may be the dominant application. It would be impossible to come up with a set of packet traces that would accurately represent both of these environments. Even if the packet traces used for profile-driven compilation are accurate with respect to the actual workload, such workloads are rarely static over time. Network traffic characteristics change from hour to hour, minute to minute, and second to second. If a packet-processing application is to keep up with performance demands when confronted with workloads that differ from those used during profiling, it may need to adapt the allocation of resources to software constructs at runtime. Besides performance, other benefits may also come from the ability to adapt. For example, if a network device could dynamically power off or reduce the clock frequency of unused or underused hardware resources, the average power consumption and heat dissipation of the device could be reduced. Also, in an environment where those with malicious intent may try to deny service to legitimate users by hoarding critical resources, the ability to adapt resource allocations at runtime may allow a device to prevent such denials of service. Supporting runtime adaptation requires a runtime system with two important properties: resource-awareness and dynamic resource adaptation. In this context, resource-awareness means that the runtime system must know which resources are being used by the application, and how effectively they are being used. The RTS provides resource-awareness through a resource abstraction layer (RAL) that is linked to the application code at runtime. For each resource type, the RAL defines an interface and includes one or more implementations of the interface. For instance, the RAL supports a queue resource type with enqueue and dequeue methods. A RAL implementation may support multiple queue implementations (e.g., on the IXP2800 network processor, queues can be implemented using next-neighbor registers, on-chip scratch memory, offchip SRAM or DRAM). This decomposition allows the runtime system to select the most-suited implementation based on the mapping of aggregates onto processors (e.g., for aggregates mapped to neighboring microengines, the runtime system selects the queue implementation that uses next-neighbor registers; while for other cases, the runtime system selects the scratch memory or SRAM/DRAM implementations). Since the RAL interfaces are linked at runtime, the resource allocations can be modified at runtime without recompiling the code. Designing an abstraction layer for a network device is a challenge for two reasons. First, although the packet-processing domain is a narrower domain than general computing, different applications found in this domain still require quite varied hardware services. For example, an Ethernet switch may require the ability to compute hashes very efficiently, whereas a VPN offload device may require
8.3
Design Details and Challenges
the ability to perform encryption and decryption very efficiently. Second, the spectrum of hardware platforms used in these applications is also broad, many requiring different methods of performing the same computational tasks. To support dynamic resource adaptation, the runtime system monitors system performance and traffic conditions and adapts resource allocations across aggregates. The system monitor allows users or higher layers of the system to define triggers based on predicates defined over runtime measures. At runtime, the monitor receives—either using a polling interface or through asynchronous event notifications—system statistics (e.g., queue lengths) from the resources, evaluates the predicates, and generates events if any of the predicates are satisfied. Based on the performance requirement of the application and the current traffic conditions, the resource allocator determines and enforces the new allocation. The design and implementation of such a distributed monitoring and resource adaptation framework is challenging for two reasons. First, the performance-sensitive nature of packet-processing applications means the monitoring infrastructure as well as the runtime system should impose as little overhead as possible. Second, the inherent heterogeneity and widely-varying capabilities of the resources available in NP-based systems makes the task of determining an optimal mapping of pipeline stages to processor resources complex. When a packet-processing pipeline is mapped onto multiple processors, a packet migrates from one processor to another, with each processor providing a portion of the total service requested by the packet. Further, each processor may simultaneously service packets belonging to multiple flows. Providing performance guarantees in such distributed, shared environments requires sophisticated techniques for coordination and scheduling of multiple resources [32, 33]. These techniques determine (1) the mapping of flows to resource instances, in the event that multiple resources with the same functional capability but with different performance characteristics are available in the packet-processing system; (2) the relative priority for processing packets from different flows at each resource; and (3) guidelines for gracefully degrading system performance (or at least the performance observed by some flows) in the presence of persistent overload. Such a resource allocation framework is essential to construct packet-processing systems that are robust to denial-of-service attacks. In addition to supporting runtime adaptation, the RTS supports features necessary for running and debugging code on an embedded network device. These features include the ability to load code, run code, and debug code. The design decomposition of the RTS is shown in Figure 8.6.
167
8
168
A Programming Environment for Packet-Processing Systems
Software developer
Loadable binaries Run-time system
Event notification service
RAL interface linker
Developer services
Resource allocator
System monitor
Resource abstraction layer (RAL)
Hardware
8.6
Runtime system decomposition.
FIGURE
The literature contains operating systems designs for multiprocessor systems [34–36], real-time systems [37–39], extensible systems [40–42, 37], and pipelined systems [9, 19, 43]. We plan to leverage many concepts from the prior work. The realization of these concepts in packet-processing systems with stringent resource and timeliness constraints poses several problems that we plan to explore.
8.4
CONCLUSIONS The programming environments—languages, compilers, and runtime systems— for NPs are in their infancy. At the same time, NPs represent a much larger trend in the processor industry: multicore, lightweight threaded architectures designed for throughput-driven applications. Once this trend hits the mainstream programming marketplace, the need for a programming environment that is as easy to use as the programming environments for today’s workstations and servers will become universally important to programmers. The Shangri-La architecture represents a complete programming environment for the domain of
References
169 packet processing on multicore, lightweight threaded architectures in general, and NPs specifically. Shangri-La encompasses: (1) a language that exposes domain constructs instead of hardware constructs, keeping the programmer and code separate from architectural details; (2) a sophisticated compiler complex that uses profile information to guide the mapping of code to processors and data structures to memory automatically; and (3) runtime system to ensure maximum performance benefits in the face of fluctuating traffic conditions—both natural and malicious. We are currently working on two major tasks: creating a prototype implementation of the proposed architecture, and researching the more difficult questions that we will face as development proceeds. The prototype system builds on the Open Research Compiler infrastructure and targets the Intel IXP2400 network processor. It includes implementations of each component shown in Figure 8.1, with simplified algorithms in some of the components. This prototype system will provide a platform for further research and development. Our current research tasks cover a wide spectrum, and our progress includes published work in language design [44] and runtime adaptation of resource allocations [45].
REFERENCES [1]
AMCC’s nP7xxx series of Network Processors, www.mmcnetworks.com/solutions/.
[2]
Agere’s PayloadPlus Family of Network Processors, www.agere.com/enterprise_metro_access/network_processors.html.
[3]
IBM PowerNP Network Processors, www-3.ibm.com/chips/techlib/techlib.nsf/products/IBM_PowerNP_NP4GS3.
[4]
iFlow Family of Processors, Silicon Access, www.siliconaccess.com.
[5]
Intel IXP family of Network Processors, www.intel.com/design/network/products/npfamily/index.htm.
[6]
The Motorola CPort family of Network Processors, www.motorola.com/webapp/sps/site/taxonomy.jsp?nodeId=01M994862703126.
[7]
TejaNP*: A Software Platform for Network Processors, www.teja.com.
[8]
L. George and M. Blume, “Taming the IXP network processor,” Proceedings of PLDI ’03, San Diego, California, pp. 26–37, 2003.
[9]
E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek, “The Click modular router,” ACM Transactions on Computer Systems, 18 (3), pp. 263–297, 2000.
8
170
A Programming Environment for Packet-Processing Systems
[10]
M. E. Kounavis, A. T. Campbell, S. Chou, F. Modoux, J. Vicente, and H. Zhang, “The Genesis Kernel: A programming system for spawning network architectures,” IEEE Journal on Selected Areas in Communications (JSAC), Special Issue on Active and Programmable Networks, 19 (3), pp. 49–73, March, 2001.
[11]
S. Silva, Y. Yemini, and D. Florissi, “The NetScript active packet processing system,” IEEE Journal on Selected Areas in Communications (JSAC), 19 (3), pp. 538–551, March 2001.
[12]
Campbell, S. Chou, M. Kounavis, V. Stachtos, and J. Vincente, “NetBind: A binding tool for constructing data paths in network processor based routers,” Proceedings of Fifth International Conference on Open Architectures and Network Programming (OPENARCH ’02), New York, pp. 91–103, June 2002.
[13]
S. Karlin and L. Peterson, “VERA: An extensible router architecture,” Computer Networks, 38 (3), pp. 277–293, 2002.
[14]
D. Mosberger, “Scout: A Path-based Operating System,” Ph.D. Dissertation, Department of Computer Science, University of Arizona, July 1997.
[15]
D. Decasper, Z. Dittia, G. Parulkar, and B. Plattner, “Router plugins: A software architecture for next generation routers,” Proceedings of SIGCOMM ’98, pp. 229–240, 1998.
[16]
F. Kuhns, J. DeHart, A. Kantawala, R. Keller, J. Lockwood, P. Pappu, D. Richards, D. Taylor, J. Parwatikar, E. Spitznagel, J. Turner, and K. Wong, “Design of a high performance dynamically extensible router,” Proceedings of DARPA Active Networks Conference and Exposition ’02, pp. 42–64, 2002.
[17]
R. Keller, L. Ruf, A. Guindehi, and B. Plattner, “PromethOS: A dynamically extensible router architecture supporting explicit routing,” Proceedings of Fourth Annual International Working Conference on Active Networks, pp. 20–31, 2002.
[18]
Y. Gottlieb and L. Peterson, “A comparative study of extensible routers,” Proceedings of Open Architectures and Network Programming ’02, pp. 51–62, 2002.
[19]
M. Philippsen, “A survey of concurrent object-oriented languages,” Concurrency: Practice and Experience, vol. 12, pp. 917–980, 2000.
[20]
Agere Functional Programming Language, www.agere.com/enterprise_metro_access/docs/PB02014.pdf.
[21]
S. Bokhari, Assignment Problems in Parallel and Distributed Computing, Kluwer Academic Publishers, 1987.
[22]
R. Gupta, S. Pande, K. Psarris, and V. Sarkar, “Compilation techniques for parallel systems,” Parallel Computing, 25 (13–14), pp. 1741–1783, 1999.
[23]
S. Orlando and R. Perego, “Scheduling data-parallel computations on heterogeneous and time-shared environments,” Proceedings of European Conference on Parallel Processing, pp. 356–366, 1998.
[24]
V. Sarkar, Partitioning and Scheduling Parallel Programs for Multiprocessors, The MIT Press, 1989.
[25]
K. Sevcik, “Characterizations of parallelism and their use in scheduling,” Proceedings of 1989 ACM SIGMETRICS Conference, pp. 171–180, 1989.
Y FL
References
M A E T
171
[26]
R. Subrahmanian, I. D. Scherson, V. L. M. Reis, and L. M. Campos, “Scheduling computationally intensive data parallel programs,” Proceedings of Placement Dynamique et Repartition de Charge: Application aux Systemes Paralleles et Repartis, Paris, France, pp. 39–60, July 1996.
[27]
T. Yang and A. Gerasoulis, “PYRROS: Static task scheduling and code generation for message passing multiprocessors,” Proceedings of the 1992 ACM International Conference on Supercomputing, Washington, DC, pp. 428–437, 1992.
[28]
M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe, “A stream compiler for communication-exposed architectures,” Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, California, pp. 291–303, October 2002.
[29]
J. Subhlok and G. Vondran, “Optimal mapping of sequence of data parallel tasks,” Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Santa Barabara, California, pp. 134–143, July 1995.
[30]
J. Subhlok, J. M. Stichnoth, D. R. O’Hallaron, and T. Gross, “Exploiting task and data parallelism on a multicomputer,” 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, California, pp. 13–22, May 1993.
[31]
R. Barua, “Maps: A Compiler-Managed Memory System for Software Exposed Architectures,” Ph.D. dissertation, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, January 2000.
[32]
A. Chandra, M. Adler, P. Goyal, and P. Shenoy, “Surplus fair scheduling: A proportional-share CPU scheduling algorithm for symmetric multiprocessors,” Proceedings of 4th Symposium on Operating System Design and Implementation (OSDI) 2000, pp. 45–58.
[33]
A. Srinivasan and J. Anderson, “Efficient scheduling of soft real-time applications on multiprocessors,” Proceedings of 15th Euromicro Conference on Real-Time Systems, IEEE Computer Society Press, pp. 51–59, July 2003.
[34]
G. C. Hunt and M. L. Scott, “The Coign distributed partitioning system,” Proceedings of 3rd Symposium on Operating Systems Design and Implementation, pp. 187–200, February 1999.
[35]
W. Shu, “Chare kernel: A runtime support system for parallel computations,” Journal of Parallel and Distributed Computing, 11 (3), pp. 198–211, 1991.
[36]
M. Young, A. Tevanian, R. Rashid, D. Golub, J. Eppinger, J. Chew, W. Bolosky, D. Black, and R. Baron, “The duality of memory and communication in the implementation of a multiprocessor operating system,” Proceedings of 11th Symposium on Operating Systems Principles, pp. 63–76, November 1987.
[37]
W. M. Gentleman, S. A. MacKay, D. A. Stewart, and M. Wein, “An introduction to the harmony realtime operating system,” Newsletter of the IEEE Computer Society Technical Committee on Operating Systems, pp. 3–6, Summer 1988.
[38]
K. G. Shin, D. D. Kandlur, D. L. Kiskis, P. S. Dodd, H. A. Rosenberg and A. Indiresan, “A distributed real-time operating system,” IEEE Software, pp. 56–68, September 1992.
8
172
A Programming Environment for Packet-Processing Systems
[39]
D. B. Stewart, D. E. Schmitz, and P. K. Khosla, “The Chimera II real-time operating system for advanced sensor-based control applications,” IEEE Transactions on Systems, Man and Cybernetics, 22 (6), pp. 1282–1295, November–December 1992.
[40]
B. Bershad, S. Savage, P. Pardyak, E. G. Sirer, D. Becker, M. Fiuczynski, C. Chambers, and S. Eggers, “Extensibility, safety and performance in the SPIN operating system,” Proceedings of ACM Symposium on Operating Systems Principles, pp. 267–283, December 1995.
[41]
G. Coulson and G. S. Blair, “Architectural principles and techniques for distributed multimedia application support in operating systems,” ACM SIGOPS Operating Systems Review, 29 (4), pp. 17–24, October 1995.
[42]
D. R. Engler and M. F. Kaashoek, “Exokernel: An operating system architecture for application-level resource management,” Proceedings of ACM Symposium on Operating Systems Principles, pp. 252–266, December 1995.
[43]
M. Welsh, D. Culler, and E. Brewer, “SEDA: An architecture for well-conditioned, scalable Internet services,” Proceedings of Symposium on Operating Systems Principles (SOSP-18), pp. 230–243, October 2001.
[44]
S. Goglin, D. Hooper, A. Kumar, and R. Yavatkar, “Advanced software framework, tools, and languages for the IXP family,” Intel Technology Journal, developer.intel.com/technology/itj/2003/volume07issue04/, November 2003.
[45]
R. Kokku, T. Riche, A. Kunze, J. Mudigonda, J. Jason, and H. Vin, “A case for run-time adaptation in packet processing systems,” ACM SIGCOMM Computer Communication Review, 34 (1), pp. 107–112, January 2004.
9 CHAPTER
RNOS—A Middleware Platform for Low-Cost Packet-Processing Devices Jonas Greutert NetModule AG, Switzerland Lothar Thiele Departement of Electrical Engineering and Information Technology, ETH Zurich, Switzerland A lot of focus has been put and is being put on research for high-performance core and edge network devices. The research objective is mainly the processing power, that is, the number of packets per second that can be processed, and the extensibility of those systems. The issues that are addressed are optimal design, new architectures, better algorithms, and implementation methods and tools, see for example Refs. [1–3]. There is another class of devices with different objectives. Small embedded devices, that have simple architectures and low performance, are deployed in large numbers. These are gateways of any type and small sensor/actuator devices with network attachment, providing a variety of services. Although these devices do not have a high capacity to process packets, they often have stringent demands to the real-time processing of certain flows. The keeping of deadlines is more important than the actual number of packets that can be processed. Just being faster would not help in most cases. Typically, these devices are low cost and are built around a standard communication controller, that is, they do not contain a highly specialized network processor. Packet processing is usually only part of the complete application that is running on these devices, although a critical part with respect to predictability. Significant work has been done in developing architectures for softwarebased routers. In Click [4] applications are composed of elements, which is a natural way to design networking applications. Click, however, lacks the concept of flows and does not provide any mechanism to schedule the available resources. The Scout OS [5] has explicit paths to improve resource allocation
9
174
RNOS—A Middleware Platform for Low-Cost Packet-Processing Devices
and scheduling. It is a soft real-time system that provides admission control with respect to CPU load and memory, but it does not provide mechanisms to calculate backlog and delay of individual flows. A concept that provides QoS to certain flows in a software-based router while optimizing the throughput for best effort traffic has been described in Ref. [6]. Although the concept is based on scheduling the CPU resource, it does not provide any real-time guarantees. An Estimation-based Fair Queuing (EFQ) algorithm that is used to schedule processing resources has been described in Ref. [7]. It also contains a concept for online estimation of processing times and an admission control. Again, no realtime guarantees can be provided. Summarizing, there are solutions available for various issues in software-based routers. A unifying approach for small embedded devices is missing that enables as well a formal analysis as an implementation that matches the predicted behavior. This chapter describes RNOS (Real-time Network Operating System), a middleware platform for low-cost packet-processing devices with real-time requirements. The RNOS consists of an analysis model and an implementation model. The analysis model allows the exploration of packet-processing applications for different input and resource scenarios with real-time requirements. The implementation model allows a seamless implementation of the analysis model on a single CPU system and guarantees the real-time behavior provided by the formal analysis of the model. As a result of the matching analysis and implementation model, we obtain a platform for the design of predictable small embedded devices. The remainder of the chapter is organized as follows: Section 2 describes the scenario we use throughout the following sections to illustrate the applicability of the approach. Section 3 presents the analysis model and Section 4 describes the implementation model of the software platform. In Section 5 we report measurement results and compare them with analysis results. Finally, we summarize and comment on future work in Section 6.
9.1
SCENARIO A single scenario is used to illustrate the concepts presented in this chapter. An existing home access router shall be extended with Voice over IP (VoIP) functionality. The hardware shall remain the same, with the only exception an additional module that will be plugged to an existing extension port. That additional module, called VoIP-module, has the necessary DSP resources to code/decode the voice packets and detect/play DTMF tones and it has all the physical
9.2
Analysis Model of RNOS
Ethernet PHY Ethernet PHY
175
MII
MII
Communication Controller
Host Port
Bus
PCM DSP
Voice If
Bus SDRAM
SRAM VoIP module
Flash
9.1
Block diagram of the low-cost embedded system used in the scenario.
FIGURE
interfaces required to connect traditional phones or a PBX. Figure 9.1 shows the block diagram of the hardware used in this scenario. As engineers we are faced with the following questions: How many voice channels are possible? Will the VoIP feature degrade existing functionality and by how much? The hardware of the home access router is based on a commercial communication processor. It has two Ethernet interfaces and an extension port for the VoIP functionality. The extension port is a 16-bit host port interface that connects seamlessly to the DSP. An external SDRAM provides the necessary memory for packets, data, and the application. Independent DMA controllers and the CPU arbitrate for the memory. Dedicated hardware units are responsible for reception and transmission of Ethernet packets. The communication controller runs with a clock frequency of 50 MHz. The CPU has 4-K byte instruction and data cache. The total bill of material is less than US$ 100 for low volumes, including the VoIP module.
9.2
ANALYSIS MODEL OF RNOS The platform RNOS consists of two closely related parts: an analysis part and an implementation part. The purpose of the analysis part is to model the whole
9
176
RNOS—A Middleware Platform for Low-Cost Packet-Processing Devices
system in terms of application, input scenarios, and resource usage and to provide a methodology that enables the analysis of relevant system properties. The implementation part, which will be discussed in Section 4, provides a middleware infrastructure and a programming interface to implement those systems. It is constructed in a way that matches the formal models in the analysis part such that the predicted system behavior is obtained. The analysis part of RNOS consists of formal models for the application, the input scenarios in terms of packet streams, the available hardware and software resource, and a calculus that allows determining the throughput and delaying as experienced by the input streams. The basic model is taken from Ref. [8] and extended toward the networking domain and small embedded devices.
9.2.1
Application Model The application model defines the required functionality of the system. To be useful, the application model has to be easy to use and has to be able to capture the domain-specific functionality [9]. One way of modeling the whole application is based on a partitioning into small processing units that will be denoted as tasks. In addition, the application model will be based on the notion of events, for example, a packet has been received or a timer elapsed. The combination of these concepts leads to the natural model that each event in the system has its own “program” that is triggered for execution when the event occurs and that consists of tasks.
Tasks Tasks are nonpreemptive execution blocks of code. Packets are received and delivered through, at most, one input and an arbitrary number of outputs, respectively. When a task executes, it takes the packet from the input, processes it, and depending on the content of the packet, puts the packet on one of its outputs. A source task has no inputs and will receive its packets from a driver, for example, from an Ethernet driver. A sink task has no outputs and will either consume the packet or it will pass the packet to a driver for transmit, for example, to the Ethernet driver, or pass it to the user mode, for example, to the socket interface. The worst case and best case execution times of each task need to be known, either by formal analysis or by simulation in case of soft real-time constraints.
Task trees A whole application contains a set of task trees, which consist of connected tasks. In particular, each output of a task is connected to one input of another task and
9.2
Analysis Model of RNOS
177
Socket
Socket
Udp
Tcp Ip-rx
Ip-reassembly
Root/source task
Ip-forwarder
Sink task
Acl-in Classifier-in Nat-rx Ip-header-check Eth-mac-rx
9.2
Classifier-out Acl-out Nat-tx Ip-fragmentation Eth-mac-tx
Simplified task tree for packet reception that contains three paths.
FIGURE
for each packet source there is a task tree with a source task at its root. Therefore, an application consists of as many task trees as there are packet sources and each task tree has exactly one source. A packet will traverse the task tree from the source to a sink. The source task will receive a packet, for example, from Ethernet, process it, and pass it to one of its outputs depending on the content of the packet where it is processed by the subsequent task. A packet will follow exactly one path when it traverses the task tree. Figure 9.2 shows a simplified task tree for packet reception in an IP router. The task tree is simplified as it does show the paths for IP packets only. The complete application consists of a number of such trees, one for each packet source. Example 9.1. In our scenario, we add the processing for VoIP traffic to the standard IP router functionality. Voice is especially sensitive to delay and delay jitter. It has been
9
178
RNOS—A Middleware Platform for Low-Cost Packet-Processing Devices
Socket
Socket
Udp
Tcp Ip-rx
Ip-reassembly Ip-forwarder To-dsp
Acl-in
Rtp
Classifier-in
Voip-ip-udp
Nat-rx Voip-data-mux Ip-header-check
Classifier-out Acl-out Nat-tx Ip-fragmentation Eth-mac-tx
Eth-mac-rx
9.3
Extended task tree for packet reception with VolP.
FIGURE
shown that a significant source of delay is the end-system itself [10]. Therefore, we add an optimized path for VoIP data packets to our task graph. To use the standard path for IP/UDP processing and pass the packet to the socket interface would be an unnecessary overhead. Figure 9.3 shows the extended task graph for packet reception with VoIP. The voip-data-mux task has a filter that is optimized to filter VoIP data traffic. If the filter matches, it is a VoIP data packet. Then the packet is passed to a combined and highly optimized IP/UDP receive task. The RTP (Real-time Transport Protocol) receive task follows and finally the packet is passed to the DSP for decoding. The reverse direction is similar (not shown): DSP, RTP, UDP/IP, and finally Ethernet transmit.
9.2
Analysis Model of RNOS
9.2.2
179
Input Model—SLA, Flows, and Microflows The input model has to capture packet flows that are specified using service level agreements (SLA) and packet flows, for which we do not know exactly (or know nothing about) how they will arrive at the system inputs. A service level agreement specifies end-to-end quality of service properties for a flow. A flow is identified by a set of common properties derived from data in the packet. Typically these are the incoming interface, ranges of source and destination IP addresses, transport protocol, and ports or port ranges. A service level agreement usually contains parameters as minimum bandwidth, maximum delay, loss probability, and maximum jitter. Another form of an SLA is the T-Spec model of IETF [11]. Each packet that is received by our application belongs to a flow. There is a service agreement for each flow, which specifies the end-to-end properties of that flow and therefore specifies how we should treat the packet inside our application, for example, with what priority the packet should traverse the task graph. All the packets that do not match a flow are associated with the best effort flow. The best effort flow has no end-to-end quality of service properties and therefore has lowest priority. Packets of the same flow might not traverse the same path in the task tree. A flow specification might cover a larger set of connections, a connection being defined as having the same incoming and outgoing interface, same source and destination IP address, the same transport protocol, and the same source and destination port. In our model such a connection is called microflow. Packets that belong to the same microflow will traverse exactly the same path through the task tree. Figure 9.4 shows a flow that consists of two microflows; each of them traverses a different task path. Service level agreements can easily be formalized as arrival curves and deadlines [8, 12].
Task paths Flow as specified by SLA
9.4 FIGURE
Flow and microflow.
Microflow 1 Microflow 2
180
9
RNOS—A Middleware Platform for Low-Cost Packet-Processing Devices
Definition 9.1. α l () is the minimum number of packets that arrive in any time interval of length . Similar, α u () is the maximum number of packets that arrive in any time interval of length . The upper and lower arrival curves specify the bounds for arrival of traffic. Therefore, we can model the uncertainties of the arrival of packets. The deadline is an internal deadline for the execution of packets of that flow. It is specified depending on the priority of a flow. Summarizing, a flow consists of one or more microflows and its quality of service requirements are specified using arrival curves and a deadline. All packets of a microflow will pass the same task path and in the order of arrival. Example 9.2. In our scenario, we have a flow for the VoIP data receive and transmit traffic, a flow for the VoIP control receive and transmit traffic, and a best effort flow for all other input packets. The SLA for the VoIP data receive traffic is very stringent if a quality comparable to traditional PSTN quality should be reached [13]. The SLA for VoIP-data (unidirectional) could be specified as follows: Minimum bandwidth
108 kbit/s
Maximum delay
160 ms
Loss probability
<1%
Maximum jitter
120 ms
The minimum bandwidth can be calculated by adding the protocol headers to the payload and multiply it by the number of packets per second. For G.711 [19] with a packet period of 10 ms this gives 138 bytes per packet which is about 108 K bit/s per direction (14 to 18 bytes Ethernet header, 20 bytes IP header, 8 bytes UDP header, 12 bytes RTP header, 80 bytes payload, and 4 bytes Ethernet CRC). From the given SLA we determine an arrival curve and a deadline. The arrival curve gives the lower and upper bounds for the number of packets to be processed in any time window. The bandwidth itself is not considered here. All the tasks in the application have a per-packet execution time only. The SLA defines a maximum jitter of 120 ms, which means that we could receive a burst of 12 VoIP data packets at line rate. The line rate in our scenario is 10 MB/s. A packet of the size of 138 bytes requires about 106 µs on the line. Figure 9.5 shows the arrival function for the VoIP data receive traffic. The total delay a user will experience consists of the network delay plus the delay introduced by the end-systems. As we want to minimize the delay of VoIP-data traffic introduced by our system, we set the deadline to the execution time for the task path on the target system. Therefore, the flow will have maximum priority.
9.2
Analysis Model of RNOS
181 au
12
# packets
10 8 6 4 2
al 1000
9.5
2000 3000 4000 Time window [µs]
5000
Upper and lower arrival curve for a VolP data traffic receive flow.
FIGURE
The VoIP data transmit traffic is a constant packet rate-flow. The data packets are generated at a constant rate by the DSP. Usually, it is a 10 ms period for G.711 and longer periods for other coders. VoIP control traffic occurs mostly at call setup and teardown. During the calls, not much information is exchanged between the endpoints or an endpoint and gatekeeper. The amount also depends on the actual protocol and which variant thereof is used, see also Refs. [14–18]. The best-effort flow has an upper arrival curve that is equal to the line rate. The deadline of the best-effort flow is infinite and therefore it has lowest priority.
9.2.3
Resource Model The resource model for the target systems, that is, small, low-cost, single CPU systems, is simple. We model the single CPU resource using a service curve as proposed in Ref. [8]. Definition 9.2. β l () is the minimum available CPU time in any time interval of length . Similar, β u () is the maximum available CPU time in any time interval of length . Therefore, upper and lower service curves specify the bounds for the availability of the CPU resource. They allow us to work with uncertain availability of the resource. We will denote the available CPU time as CPU capacity in the following example. Example 9.3. In the scenario, the CPU of the communication controller runs a standard real-time operating system. The real-time operating system will provide 6 ms of a 10-ms
9
182
RNOS—A Middleware Platform for Low-Cost Packet-Processing Devices
bu
5000
CPU capacity
4000 bl
3000 2000 1000 1000
9.6
2000 3000 Time window (µs)
4000
5000
Upper and lower service curves for a CPU resource.
FIGURE
interval to the execution of the tasks. The other 4 ms will be used to run other (user-level) applications. However, if no other applications require the CPU, it is available to the packet processing. Figure 9.6 shows the upper and lower service curves for the CPU resource as it is available for the execution of tasks. The CPU capacity divided by the time interval is the fraction of the total CPU resource that is available.
9.2.4
Calculus The methodology to analyze important system properties is based on the calculus that has been introduced in Ref. [8]. It allows determining the throughput and delay as it is experienced by the various input flows. The inputs to the calculus are the arrival curve, the deadline, the worst-case and best-case execution times for each flow, and the service curves of the resources. Each task ti has an upper and lower execution time, eui and eli , respectively. Estimates of the upper and lower execution times can be measured while the system is under maximum load or idle, respectively. Nevertheless, this approach is restricted to soft quality of service constraints only. Another possibility is to formally analyze the tasks that yield bounds on the worst-case and best-case execution times. A task tree T contains a set PT of task paths pj ∈ PT . Each task path pj consists of a set of tasks ti ∈ pj . The upper and lower execution time eupj and
9.2
Analysis Model of RNOS
183
elpj of a task path pj is the sum of the execution times eui and eli of its tasks ti , that is: eupj = euT
i∈pj
eui
= max pj ∈PT
eupj
elpj = elT
i∈pj
eli
= max elpj pj ∈PT
where euT and elT denote the maximum upper and lower execution time of a task tree T. A flow F contains a set MF of microflows mk ∈ MF . The upper and lower execution time eumk and elmk of a microflow mk is equal to the upper and lower execution time eupj and elpj of its associated task path pj , and euF
and elF denote the maximum upper and lower execution time of a flow F, that is: eumk = eupj
elmk = elpj
eumF = max eumk
elmF = min elmk
mk ∈F
mk ∈F
Now, we can determine the upper and lower arrival curves α u and α l that describe the processed packet stream and the upper and lower service curves β u and β l that describe the remaining CPU capacity. The equations are derived from the real-time calculus that has been used in Ref. [8] to estimate the worstcase performance of complex network processor architectures. The equations are slightly different from those in Ref. [8] and yield closer bounds. At first, we define some operators that will be used: ν() ∧ w() = min ν(), w() ν() ⊕ w() = inf ν(λ) + w( − λ) 0≤λ≤ ν() ⊗ w() = sup ν( + λ) − w(λ) 0≤λ
ν() ⊕ w() = sup
0≤λ≤
ν() ⊗ w() = inf ν( + λ) − w(λ) 0≤λ
ν(λ) + w( − λ)
9
184
RNOS—A Middleware Platform for Low-Cost Packet-Processing Devices
Using these definitions, the required relations can be expressed as follows: α u () = euF α u () α l () = elF α l () 1 1 u t l n α˙ () = u α˙ () α˙ () = l α˙ () eF eF
α˙ u () = α u () ⊕ β u () ⊗ β l () ∧ β u () α˙ l () = α l () ⊗ β u () ⊕ β l () ∧ β l () β˙ n () = β u () − α l () ⊗ 0 β˙ l () = β l () − α u () ⊕ 0
(9.1) (9.2) (9.3) (9.4) (9.5) (9.6)
Equations (9.1) and (9.2) are used to scale the arrival of packets with the execution times and back. The resulting worst-case delay of packets can be calculated to be the maximal horizontal distance between the upper arrival and lower service curves, that is:
delay ≤ sup inf τ : τ ≥ 0 ∧ α u () ≤ β l ( − τ ) ≥0
The calculations have been derived for preemptive tasks. As the tasks in the application model of RNOS are nonpreemptive execution blocks, the actual worst-case delay is the calculated delay plus the execution time of the longest task, that is, delay + max(ei ). For best effort we use the worst-case and besti case execution times euT and elT instead of euF and elF . Figure 9.7 represents graphically the calculation scheme that is applied. In step one, Equations (9.1),
{bl,bu} –u {a–l,a }
1
{al,au} Convert
.
2 Compute
.
.
{bl,bu}
9.7 FIGURE
Calculation scheme.
.
{al,au}
. .
3
{a–l,a–u} Convert
9.2
Analysis Model of RNOS
185 {bl, bu}
Priority High
—l —u
Flow 1
{aF l, aF l}
Flow 2
{aF 2, aF 2}
Flow 3
{aF 3, aF 3}
Flow BE
{aFBE , aFBE }
—l
—u
—l
—u
—l
—u
· —u · —l
Calculate
{aF l, aF l}
Calculate
{aF 2, aF 2}
Calculate
{aF 3, aF 3}
Calculate
{aFBE , aFBE }
· —l
· —u
· —l
· —u
· —l
· —u
Low ·
·
{bl, bu}
9.8
Fix-priority calculation scheme.
FIGURE
in step two, Equations (9.3) to (9.6), and in step three, Equation (9.2) are applied. To get the priority of the flow F, we divide the deadline dF by the upper execution time euF of the flow. A lower number denotes a higher priority. Now we can repeatedly apply the calculation scheme shown in Figure 9.7 to the fixed priority scenario as shown in Figure 9.8. Example 9.4. In our scenario we would like to know how many voice channels can be processed and what the impact is on the best-effort forwarding service. As packets can arrive at line rate and the system is not capable of processing all the packets at line rate, there must be a flow filter early in the task tree. Up to the flow filter, the system has to process all the packets that arrive at the system. Only after the flow filter, the system will know which packets to prioritize. Figure 9.9 shows the added flow-filter task in the task tree for packet reception. Table 9.1 shows the upper execution times and the deadlines for the flows, including the special flow-filter flow, which includes the tasks eth-mac-rx and flow-filter. The flowfilter flow contains all packets. The arrival curve for the flow-filter flow is the maximum packet rate (smallest packet size divided by line rate). The arrival curves of the other flows have been discussed previously.
9
186
RNOS—A Middleware Platform for Low-Cost Packet-Processing Devices
ip-header-check Flow-filter Eth-mac-rx
9.9
Added flow-filter task in task tree for packet reception.
FIGURE Flow
9.1
Upper
Priority/
Execution
Internal
Time
Deadline
Flow Filter
22 µs
22 µs
VoIP Data Receive
68 µs
1000 µs
VoIP Data Transmit
48 µs
2000 µs
VoIP Control Receive
678 µs
100, 000 µs
VoIP Control Transmit
778 µs
100, 000 µs
Best Effort Packet Forwarding
258 µs
Infinite
Execution times and deadline of flows.
TA B L E
Figure 9.10 depicts the best-effort traffic forwarding rate depending on the number of active voice channels. The previously described calculus allows us to calculate the backlog and delay of each individual flow. For example, we can determine the influence of the worst-case arrival jitter of voice data traffic on the delay of the receive control traffic. Figure 9.11 depicts the results for a worst-case jitter of 0, 60, 120, and 180 ms. With the calculus and the model, it is easy to determine weak points in the system and where optimizations are worth considering. It is obvious that the implementation of the flow-filter should be very efficient; otherwise, most of the available resources will be spent filtering input traffic.
9.3
Implementation Model of RNOS
187
Forwarding rate pps
2400
2200
2000
1800 0
9.10
2
4 # Voice channels
6
8
Best effort traffic forwarding rate as a function of active voice channels.
FIGURE
60000 120 ms
Delay µs
50000 40000
80 ms 30000 20000
60 ms
10000 0 ms
0 2
9.11 FIGURE
9.3
3
4 5 6 # Voice channels
7
8
Delay of voice receive control traffic for different worst-case jitter of the voice data traffic.
IMPLEMENTATION MODEL OF RNOS The implementation part of the middleware platform provides the necessary infrastructure and application programming interface to implement applications that have been modeled using the analysis model. The approach
9
188
RNOS—A Middleware Platform for Low-Cost Packet-Processing Devices
is characterized by two properties: 1. The middleware is constructed in a way that matches the analysis model. Therefore, the performance properties of the resulting implementation are within the calculated bounds. 2. The programming interface reflects the abstraction provided by the application structure, that is, task trees and task paths, and the input model, that is, flows and microflows. In addition to the application, input, and resource models described in Section 9.2, two further concepts are provided to build the implementation model. One is the path-thread, which links packets (which are part of a flow and belong to a microflow) with a task path. The task path is defined by the microflow the packet belongs to. The second element is the scheduler, which decides which path-thread and therefore which task is executed next.
9.3.1
Path-Threads For each packet that is to be processed, a path-thread is created. A path-thread has a task counter, a packet, an associated flow and microflow description, and therefore uniquely identifies a task path. The task counter points to the next task that is to be executed. Figure 9.12 gives the complete picture of the elements of our implementation model. Example 9.5. In our scenario we have a source path-thread that matches the incoming packet to a microflow. If the microflow is known, the source path-thread will create a new path-thread for the packet of that microflow. Otherwise, it will create a path-thread for the default nonreal-time flow. Figure 9.13 shows the source path-thread and three path-threads that might be created by the source path-thread.
9.3.2
Scheduler The scheduler decides which path-thread runs next. The selected path-thread will execute one task and surrender control back to the scheduler. We use an EDF scheduler to schedule the path-threads. As the tasks are nonpreemptive, the granularity of the tasks provides the preemption points for the EDF scheduler. Nonreal-time flows have an infinite deadline and therefore get executed only when there is no pending packet of a real-time flow.
Implementation Model of RNOS
189
Has a Path-thread
Part of Flow
Packet
Belongs to
Is assoc.
9.3
Microflow Has a
Uses a
Task path
9.12
Consists of 1..*
Task
Object diagram of implementation model.
FIGURE
9.3.3
Implementation The RNOS is implemented on top of an existing real-time operating system. It runs in one thread of the RTOS. The only precondition to the RTOS is that it must be capable of providing a predefined amount of processing power to the RNOS. In our implementation, the RNOS gets 6 ms of processing power in each interval of 10 ms; see Example 9.3. RNOS is implemented in Embedded C++. A complete set of APIs and base classes allow an efficient implementation of networking applications. Figure 9.14 depicts the elements of the RNOS and that it runs in a thread of the underlying RTOS. Programming with RNOS is different than programming with a standard RTOS. Applications are built from a set of existing and new tasks. Since the analysis cannot be completed without having the worst-case and best-case execution times of all the tasks, the tasks themselves have to be programmed first. The RNOS supports the measurement of worst-case and best-case execution times. The best-case execution time is measured while executing the task on an idle system while all the interrupts are disabled and the task has been preloaded into the cache. The worst-case execution time is more difficult to determine. It is measured having all caches and interrupts disabled. To that, the worst-case
9
190
RNOS—A Middleware Platform for Low-Cost Packet-Processing Devices
Forwarding-thread VoIPRxCtrl-thread
Eth-mac-tx
Socket
Ip-fragmentation
Tcp
Nat-tx
Ip-rx
Acl-out
Ip-reassembly
Classifier-out
To-dsp
Ip-forwarder
Ip-forwarder
Rtp
Acl-in
Acl-in
Voip-ip-udp
Classifier-in
Classifier-in
Voip-data-mux
Nat-rx
Nat-rx
Ip-header-check
Ip-header-check
Ip-header-check
VoIPRxData-thread
Flow/microflow-filter Source-thread Eth-mac-rx
9.13
Source path-thread and three other path-threads.
FIGURE
execution times of the interrupt service routines are added. Another possibility to get the execution times is to formally analyze the compiled task code [20]. The more tasks that are available, the more RNOS becomes a construction kit to build networking applications. With RNOS, we have path-threads instead of standard threads. The lifetime of a path-thread is short, compared to standard threads. It exists only as long as its associated packet has not been processed completely. For each packet there is a path-thread. As we are concerned with packet processing only, the path-threads appear to be a very natural way of programming. Table 9.2 gives some analogies of the elements of a traditional RTOS and RNOS.
Implementation Model of RNOS
191
Mircoflow
Flow
Source thread
Learn thread
Task
Task tree
Path thread
Scheduler
RNOS – Real-time network operating system
Thread
Real-time operating system
9.14
Domain specific abstraction
9.3
RNOS runs in a thread of the underlying RTOS.
FIGURE
9.2
Real-time Operating
Real-time Network
System
Operating System
Instruction
Task
Program
Task Path
Thread
Path-Thread
Program Counter
Task Counter
Thread Priority
Deadline
Registers and Memory
Packet and Packet Annotations
Analogy of RTOS and RNOS.
TA B L E
The RNOS is not free of overhead. The creation and termination of paththreads and switching between path-threads reduce the CPU capacity available to the processing of tasks. The overhead of RNOS has been determined by comparing the throughput of the system dependent on the number of threadswitches. Compared to a hand-coded system without path-threads, an overhead of 1 percent of the overall execution time for the creation and termination of path-threads and 3 percent for each switching between path-threads is introduced on our example system. If we allow fewer preemption points (by combining several tasks to a virtual task), the delay incurred to real-time flows
9
192
RNOS—A Middleware Platform for Low-Cost Packet-Processing Devices
increases and the overhead decreases. The granularity of tasks influences the tradeoff between packet delay and available throughput. The investigation of this relationship is subject of further work.
9.4
MEASUREMENTS AND COMPARISON We have measured the best-effort forwarding rate under worst-case conditions. Figure 9.15 shows the measurement setup. The SmartBits is a professional network performance analysis system. It produces Ethernet traffic at line rate with 64-byte packets on line A. System 1 forwards this traffic to System 2, which forwards it back to the SmartBits (by line B). This way it is possible to measure throughput and delay. Additionally, we inject voice calls from System 1 to System 2. The scenario represents a worst-case system load for our example system. The voice channels do not get distorted (measured with Abacus, a professional telephony test system), that is, the voice data packets are processed with highest priority, while the traffic generated by the SmartBits represents the worst-case load for the example system. Figure 9.16 shows the measured and calculated best-effort traffic that is forwarded depending on the number of active voice channels. The measured numbers are 1–5 percent higher than the calculated numbers. For zero voice channels, the numbers lie within the inaccuracy of the measurement (1 percent). The larger the number of voice channels, the larger becomes
Smart bits
A
B
C
System 1
9.15 FIGURE
Measurement setup.
System 2
Conclusions and Outlook
Forwarding rate [pps]
9.5
193
3000
Measured Calculated
2500 2000 1500 1000 500 0 0
9.16
1
2
3 4 5 # Voice channels
6
7
8
Measured and calculated best-effort forwarding rate
FIGURE
the difference. The cache efficiency is higher with a larger number of voice channels, as the voice packets arrive in bursts (voice packets of all channels come immediately behind each other) and therefore a better cache hit ratio occurs.
9.5
CONCLUSIONS AND OUTLOOK The software platform RNOS allows a natural domain-specific modeling of applications for single CPU systems. Using the real-time calculus, the system parameters can be explored and weak points of the system can be exposed early in the design cycle. In a second step, the modeled application can be implemented seamlessly on top of RNOS. The results that are obtained with the real system are comparable with the results obtained in the analysis, which was one of the major goals of this work. There are several open questions that will be investigated further. One issue is related to the granularity of tasks and the corresponding impact on delay, throughput, and overhead. A next question is related to the assignment of deadlines. The current implementation adds a static deadline to each packet. At least for multimedia flows, we could add deadlines depending on the current state, for example, a buffer level. The provided calculus needs to be extended in order to cope with this extension. A final issue is the analysis of tradeoffs related to the batch processing in terms of a better cache usage, a smaller overhead, and an increased packet delay.
9
194
RNOS—A Middleware Platform for Low-Cost Packet-Processing Devices
ACKNOWLEDGMENTS This work was funded in part by the Swiss Innovation Promotion Agency (KTI/CTI) through the project KTI 5500.2.
REFERENCES [1]
R. Keller, L. Ruf, A. Guindehi, and B. Plattner, “PromethOS: A dynamically extensible router architecture for active networks,” Proceedings of IWAN 2002, Zurich, Switzerland, pp. 20–31, December 2002.
[2]
Scott Karlin and Larry Peterson, “VERA: An extensible router architecture,” IEEE OPENARCH 01, Anchorage, Arkansas, April 2001.
[3]
Lukas Kencl and J. Y. Le Boudec, “Adaptive load sharing for network processors,” Proceedings of Infocom, pp. 545–558, 2002.
[4]
E. Kohler et al., “The click modular router,” ASM Transactions on Computer Systems, 18(3), pp. 263–297, August 2000.
[5]
David Mosberger and Larry L. Peterson, “Making paths explicit in the Scout operating system,” Proceedings of the Second Symposium on Operating Systems Design and Implementation, pp. 153–167, October 1996.
[6]
X. Qie, A. Bavier, L. Peterson, and S. Karlin, “Scheduling computations on a programmable router,” Proceedings of the ACM SIGMETRICS 2001 Conference, pp. 13–24, June 2001.
[7]
Prashanth Pappu and Tilman Wolf, “Scheduling processing resources in programmable routers,” Proceedings of the Twenty-First IEEE Conference on Computer Communications (INFOCOM), pp. 104–112, New York, June 2002.
[8]
L. Thiele, S. Chakraborty, M. Gries, and S. Künzli, “Design Space Exploration of Network Processor Architectures,” First Workshop on Network Processors (NP1) at the 8th International Symposium on High Performance Computer Architecture (HPCA8), Cambridge, Massachusetts, pp. 30–41, February 2002.
[9]
Niraj Shah, William Plishker, and Kurt Keutzer, “NP-Click: A Programming Model for the Intel IXP1200,” Second Workshop on Network Processors (NP2) at the 9th International Symposium on High Performance Computer Architecture (HPCA9), Anaheim, California, pp. 100–111, February 2003.
[10]
B. Goodman, “Internet telephony and modem delay,” IEEE Network, May 1999.
[11]
S. Shenker and J. Wroclawski, “General characterization parameters for integrated service network elements,” Request for Comment 2215, Internet Engineering Task Force, September 1997.
[12]
Matthias Gries, “Algorithm-Architecture Trade-offs in Network Processor Design,” Ph.D. dissertation, ETH Zurich, Switzerland, July 2001.
References
195
[13]
J. Janssen, Danny De Vleeschauwer, and Guido H. Petit, “Delay and distortion bounds for packetized voice calls of traditional PSTN quality,” Proceedings of the 1st IP-Telephony Workshop (IPTel 2000), pp. 105–110, April 2000.
[14]
International Telecommunication Union, “Packet based multimedia communication systems,” Recommendation H.323, Telecommunication Standardization Sector of ITU, Geneva, Switzerland, February 1998.
[15]
F. Andreasen and B. Foster, “Media Gateway Control Protocol (MGCP) Version 1.0,” Request for Comment (Informational) 3435, Internet Engineering Task Force, January 2003.
[16]
F. Cuervo, N. Greene, A. Rayhan, C. Huitema, B. Rosen, and J. Segers, “Megaco Protocol Version 1.0,” Request for Comment (Proposed Standard) 3015, Internet Engineering Task Force, November 2000.
[17]
M. Handley, H. Schulzrinne, E. Schooler, and J. Rosenberg, “SIP: Session Initiation Protocol,” Request for Comment (Proposed Standard) 2543, Internet Engineering Task force, March 1999.
[18]
M. Handley and V. Jacobson, “SDP: Session Description Protocol,” Request for Comment (Proposed Standard) 2327, Internet Engineering Task Force, April 1998.
[19]
International Telecommunication Union, “Pulse code modulation (PCM) of voice frequencies,” Recommendation G.711, Telecommunication Standardization Sector of ITU, Geneva, Switzerland, November 1988.
[20]
R. Heckmann, M. Langenbach, S. Thesing, and R. Wilhelm, “The influence of processor architecture on the design and the results of WCET tools,” Proceedings of IEEE on Real-Time Systems, 91(7), pp. 1038–1054, July 2003.
10 CHAPTER
On the Feasibility of Using Network Processors for DNA Queries Herbert Bos Department of Computer Science, Vrije Universiteit Amsterdam, The Netherlands Kaiming Huang Leiden Institute of Advanced Computer Science, Universiteit Leiden, The Netherlands The term network processor unit (NPU) is used to describe novel architectures designed explicitly for processing large volumes of data at high rates. Although the first products became available only in the late 1990s, a huge interest, both industrial and academic, has led to there now being over thirty different vendors worldwide, including major companies such as Intel, IBM, Motorola, and Cisco [1]. The prime application area of NPUs has been network systems, such as routers and network monitors that need to perform increasingly complex operations on data traffic at increasingly high link speeds. Indeed, to our knowledge, this has been the only application domain for NPUs to date. We observe, however, that bio-computing fields such as DNA processing share many of the properties and problems of high-speed networks. For example, in well-known algorithms like Blast and Smith-Waterman, a huge amount of data (e.g., the human genome) is scanned for particular DNA patterns [2]. This is similar to inspecting the content of each and every packet on a network link for the occurrence of the signature of an “Internet worm” (a typical, albeit rather demanding intrusion detection application [3]). We also observe that many bio-computing fields suffer from poorly performing software when run on common processors, which to a large extent can be attributed to the fact that the hardware is not geared for high throughput or the exploitation of parallelism. For example, in our lab the performance of the Blast algorithm was improved significantly when implemented on an FPGA compared to the identical algorithm on a state-of-the-art Pentium [4]. Similarly,
10
198
On the Feasibility of Using Network Processors for DNA Queries
using an array of 16 FPGAs in the Bioccelerator project, the Smith-Waterman algorithm improved by a factor of 100–1000 [5]. Unfortunately, FPGAs are hard to program and VHDL programmers are scarce. NPUs are an interesting compromise between FPGAs and general-purpose processors; they can be programmed in C and exploit parallelism to deal with high rates. Indeed, their design is optimized explicitly to deal with such data streams. For this reason, we evaluated the suitability of NPUs for implementing wellknown DNA processing algorithms, in particular Blast. To achieve a realistic performance measure, the first (parallelizable) stage of the Blast algorithm has been implemented on an Intel IXP1200 network processor and used to process realistic queries on the DNA of a Zebrafish. The implementation will be referred to as IXP Blast throughout this chapter. The contribution of this chapter is two-fold. First, to our knowledge it is the first attempt to apply NPUs in an application domain that is radically different from network systems. Second, it explores in detail the design and implementation of one particular algorithm in bio-informatics on a particular platform (the IXP1200). The hardware that we have used in the evaluation is by no means state of the art. For example, the IXP1200 clocks in at a mere 232 MHz and provides no more than 24 microengine hardware contexts, while more modern NPUs of the same family operate at clock speeds that are almost an order of magnitude higher and offer up to 128 hardware contexts. Nevertheless, we feel confident that the results, bottlenecks identified, and lessons learned, apply to many different types of NPU, including the state of the art. The ultimate goal of this work, of which the results presented in this chapter represent only a first step, is to empower molecular biologists, so that they will be able to execute complicated DNA analysis algorithms on relatively cheap NPU boards plugged into their desktop computers. This contrasts sharply with current practices, where biologists often have to send their queries to a specialized analysis department which then runs the algorithm on a costly cluster computer (such as the BlastMachine2 [6]) and subsequently returns the results. The remainder of this chapter is organized as follows. The software and hardware architecture are discussed in Section 10.1. Section 10.2 discusses implementation details. The implementation is evaluated in Section 10.3, while related work is described in Section 10.4. In Section 10.5, we conclude and attempt to evaluate the suitability of (future generations of) NPUs for DNA processing.
10.1
ARCHITECTURE In this section, we discuss the problem area and the Blast algorithm, as well as the hardware and software configuration to implement the algorithm.
10.1
Architecture
10.1.1
Scoring and Aligning IXPBlast is an implementation on Intel IXP1200 NPUs of the most time-consuming stage of the Blast algorithm for genome searches. Blast (Basic Local Alignment Search Tool) allows a molecular biologist to search nucleotide and protein databases with a specific query [2]. For illustration purposes, we assume in this chapter that all relevant information consists of DNA nucleotides (of which there exist exactly four, denoted by A, C, G, and T, respectively). Nevertheless, the same implementation can be used for other types of sequences as well (e.g., amino acids or proteins). Blast is designed to discover the similarity of a specific sequence to the query, by attempting to align the two. For example, if the query exactly matches a subset of the sequence in the database, this will be a perfect match, resulting in the highest possible score in the similarity search. However, even if two sequences differ at various places, they might still be “similar” and hence interesting to the molecular biologist. For instance, researchers may have located a gene (a sequence of DNA nucleotides) on the DNA of a Zebrafish that controls, say, the thickness of the Zebrafish’s spinal cord and wonder whether a similar gene exists in the human genome. Even though the match may not be perfect, a strong similarity might hint on similar functionality. The differences might be very small (e.g., the sequences are basically the same except for a few strands of “dead”—nonfunctional—DNA), or more complicated (e.g., while most of the DNA is present, some subsequences of the nucleotides have moved). Such sequences should also score high on the similarity test. Denoting the large sequence in the database as DNA db (e.g., the human or Zebrafish genome), and the (smaller) sequence that is searched for in the DNA db as the query (e.g., the DNA sequence of a specific gene), we can now distinguish two phases in the Blast algorithm. In Phase 1, the scoring is done. Given a query, this phase will keep track of the location where a (possibly small) subset of the query matches a subset of DNA db to which it is compared. In genome searches, this phase is the most compute intensive and exhibits a lot of potential parallelism. For example, given n computers, each processor could be used to score one nth of DNA db . In Phase 2 the scores are used to perform the actual alignment by searching for the area with maximum overall score (for instance, because a number of consecutive substrings have matched). This area represents the part of DNA db that is most “similar” to the query and the overall score indicates how similar the two sequences are. By its nature, this phase is predominantly sequential. We therefore concentrate solely on Phase 1 of the Blast algorithm, and do not consider Phase 2 at all, other than to remark that it is handled using a conventional implementation on a Pentium processor.
199
10
200
On the Feasibility of Using Network Processors for DNA Queries
Although Blast resembles certain algorithms in networking (e.g., programs that scan the contents of a packet for the signature of a worm), there are some differences. Where signature search algorithms often try to find an exact match for an entire string, Blast also performs partial matches. In other words, whenever a substring of a certain minimum size in the query matches a substring of the same size in DNA db , this is logged. In this chapter, this minimum size is referred to as the window size. For example, in Figure 10.1 the DNA db that is shown at the top is scanned for the occurrence of a pattern that resembles the query shown at the bottom. As partial matches also count, the query is partioned in overlapping windows that are all individually searched for in DNA db . Whenever a window matches, the position where the match occurs is recorded. Figure 10.1 illustrates the use of a window size of four. As a trivial example, suppose that the genome at the top of the figure is 100 nucleotides long. Parallelism can be exploited by searching for the query in the first segment of 50 nucleotides of DNA db on one processor and in the remaining segment of 50 nucleotides on a second processor. Note that with a window size of four, three windows that start in segment 1 will end in segment 2, so care must be taken to handle these cases correctly. In the figure, two matches are shown: window 0 (ACGA) matches at position 0, and
DNADB: The sequence of DNA material in the database
0
1
2
3
A A
C C C
G G G G
A A A A A
4 T T T T T
match!
5 T
T T T T
6
7
8
9
A
C
C
A
A A A A
C C C C
C C C
A A
window window window window window window window window
1 2 3 4 5 6 7 8
window window window window
1 2 3 4
Query of 7 nucleotides
A A
10.1 FIGURE
C C C
G G G G
A A A A A
C C C C
C C C
Sequences, queries, and windows.
A
A
match!
10.1
Architecture
window 3 (ACCA) matches at position 6. So the pairs (0,0) and (3,6) are recorded. Phase 2 looks at where each of the windows was matched (scored) and attempts to find an area with maximum score (e.g., where many consecutive windows have matched, but which may exhibit a few gaps or mismatches). Each window in the query should be compared to every window in DNA db . Denoting the window size as W, the number of nucleotides in DNA db as N, and the number of nucleotides in the query as M, the number of nucleotide comparisons that is needed to process DNA db in a naive implementation is given by: W · (N − W + 1) · (M − W + 1). A more efficient matching algorithm, in terms of number of comparisons, is known as Aho-Corasick [1]. In this algorithm, all the windows in the query are stored as a single trie structure. After setting up the trie, DNA db is matched to all the windows at once, one nucleotide at a time (see also Section 10.1.4). Aho-Corasick may be said to trade the number of comparisons for the number of states in the trie. For example, given a window size W, a (theoretical) upperbound i to the number of states in the trie (independent of the query) would be W i=0 4 (capable of representing all possible combinations of sets of W nucleotides). In practice, however, the number of states would be much less. In the query used in this chapter, known as MyD88 among molecular biologists, the number of nucleotides is 1611 and the number of states is a little over 104 . The number of comparisons (and state transitions) in Aho-Corasick equals the number of nucleotides in DNA db . Both the naive strategy with direct comparisons (IXPBlastdirect ) and Aho-Corasick (referred to simply as IXPBlast in this chapter) were implemented on the network processor. Results are discussed in Section 10.3. IXPBlastdirect was included to provide insight in the performance of the IXP1200 when performing a fixed and large set of operations and as a baseline result. The main focus of our research, however, is on IXPBlast, due its greater efficiency.
10.1.2
Hardware Configuration The IXPBlast hardware configuration is shown in Figure 10.2. A general-purpose processor (Pentium-III) on a host is connected to one or more IXP1200 evaluation boards plugged into the PC’s PCI slots. Each of the boards contains a single IXP1200 NPU, onboard DRAM, and SRAM (256 MB and 8 MB, respectively), and two gigabit Ethernet ports. In our evaluation we used only a single (Radisys ENP-2506) NPU board. The NPU itself contains a two-level processing hierarchy, consisting of a StrongARM control processor and six independent RISC processors, known as microengines (MEs). Each ME supports four hardware contexts that have their
201
10
202
On the Feasibility of Using Network Processors for DNA Queries
DNA material
Scratch
PCI
10.2
The IXP hardware configuration.
FIGURE
own program counters and register sets (allowing it to switch between contexts at zero cycle overhead). Each ME runs its own code from a small (1 K) instruction store. MEs control the transfer of network packets to SDRAM in the following way. Ethernet frames are received at the MACs and transferred over the proprietary IX bus to an NPU buffer in 64-byte chunks, known as mpackets. From these buffers (called RFIFOs by Intel, even though they are not real FIFOs at all) the mpackets can be transferred to SDRAM. MEs can subsequently read the packet data from SDRAM in their registers in order to process it. While this description is correct as a high-level overview of packet reception, the actual reception process is a bit more involved. In Section 10.1.3, a detailed explanation of packet reception in IXPBlast is given. On-chip there is also a small amount of RAM, known as scratch memory (4 KB on the IXP1200) that is somewhat faster than SRAM, but considerably slower than registers. Table 10.1 summarizes the memory hierarchy and the access time for the most important types of memory (timings obtained from Ref. [7]). Interestingly, the differences in access times between Scratch and SRAM (and SRAM and SDRAM), is only a factor of 1.5–2. All buses are shared and while the IX bus is fast enough to sustain multiple gigabit streams, memory bus bandwidth might become a bottleneck. Fortunately, the IXP provides mechanisms for latency hiding, for example, by allowing programs to access memory in a nonblocking, asynchronous way.
10.1
Architecture
10.1
203
Memory
Size
Access Time (cycles)
general purpose registers
128*4 bytes per ME
scratch memory
1 K words
12–14
SRAM
8 MB
16–20
SDRAM
256 MB
33–40
–
IXPBlast memory hierarchy.
TA B L E
In our configuration, practically all the code runs on the MEs. This means that the code needs to be compact, with an eye on the small instruction stores. The size of the instruction store also means that MEs cannot afford to run operating-system code. Instead, programmers run their code directly on the hardware. The StrongARM is used for loading the appropriate code in the ME instruction stores, for starting and stopping the code, and for reading the results from memory. In our configuration it runs a Linux 2.3.99 kernel, which we log on to from the Pentium across the PCI bus. In the first stage of the Blast algorithm, the Pentium is not used for any other purpose. The second stage, however, runs solely on the Pentium and does not concern the NPU at all.
10.1.3
Software Architecture Given that the Blast program was to be implemented on an IXP1200, the first problem that needed to be solved was that of accessing the data, example, determining where to store DNA db and where to store the query.
Accessing the DNAdb sequence Starting with the former, many of the options are immediately ruled out, due to the size of the data sequences, which can be several gigabytes. In fact, the only viable storage locations for this data are the IXP’s SDRAM, and/or some form of external storage (e.g., the host’s memory, or an external database system). As the ENP-2506 has only 256 MB of SDRAM memory (of which some is used by Linux), it is not possible to load DNA db in its entirety in the IXP’s memory. On the other hand, since Blast requires access to the entire data, it needs to be as close as possible, for efficiency. This effectively rules out any solution that depends on DNA db remaining on external storage throughout the execution (e.g., continuously accessing data in host memory across the PCI bus). So, data has to be streamed into the IXP from an external source.
204
10
On the Feasibility of Using Network Processors for DNA Queries
Again, there are only two ways of streaming DNA db into the NPU: (1) across the PCI bus, or (2) via the network ports. We have chosen the latter solution for a number of reasons. First, IXPBlast is intended as a network service for molecular biologists. For this reason, the ability to stream network packets to network cards from various places on the network is quite convenient. Second, in our case, the PCI bus that connects the IXP1200 to the Pentium is not very fast (64/66) and hence might be stretched by the gigabit speeds offered by the 2 Gb network ports. Third, even if a faster bus had been available, it is commonly stated (true or not) that network speed will outgrow bus speed (for example, OC-768 operates at 40 Gb/s, which is beyond the capacity of next-generation PCI buses like PCI-X 533). As our aim was to evaluate NPUs with an eye on applying them in a variety of DNA processing fields, not just Blast, the solution that (in potential) provides the highest throughput seemed the best. Fourth, this decision seems in line with what NPUs were intended for: processing data packets coming in from the network. Future versions of NPUs might try to optimize the handling of network packets even more and we should therefore aim for a design that would benefit from such new features. The price that we pay for this, of course, is that we now have to dedicate valuable processing time on the IXP to packet reception, stripping off protocol headers, etc. Since the Blast algorithm performs a fair amount of computation per chunk of data (and on an IXP1200 is not able to cope with gigabit speeds anyway), it would have been significantly faster, in retrospect, to let the Pentium write the data in the IXP’s SDRAM directly. It should be mentioned that in IXPBlastdirect the naive implementation, the encoding is deliberately suboptimal in size. That is, although each nucleotide can be encoded with as few as two bits, we found that this may yield less than optimal performance, as a result of all the extra bit shifting/bit extraction that is required. This is explained in more detail in Section 10.1.5.
Accessing the query While DNA db does not fit even in the IXP’s SDRAM, this is not the case for the query. Meaningful queries are generally much smaller than full DNA db . Ideally, the query would be fully contained in the NPU’s registers. However, this is possible only for very small queries.1 In practice, query sizes range from a few hundred to a few thousands nucleotides. This will easily fit in SRAM and possibly even in the 4 KB of scratch memory, depending on how nucleotides are encoded. 1. Assuming we use all 128 general-purpose registers and a 2-bit encoding for the nucleotides, the theoretical upperbound would be a query of 128 × 32/2 = 2048 nucleotides, which is still realistic. However, this would leave no registers for the computation.
10.1
Architecture
205
For now, we assume that in IXPBlastdirect the query is stored in scratch memory. We may reasonably expect scratch memory to be sufficiently large, especially considering that newer versions of the IXP have much larger scratch spaces (e.g., 16 KB on the IXP2800). Even so, if a query becomes even larger than that, it is trivial to use SRAM instead. As indicated by Table 10.1, the difference in access between scratch memory and SRAM in the IXP1200 is less than 50 percent. In the Aho-Corasick implementation of IXPBlast we are not so fortunate. The number of nodes in the trie quickly expands to a size that exceeds the capacity of scratch memory. For this reason, the entire query in IXPBlast is always stored in SRAM.
Overview of packet processing In Figure 10.3, we have sketched the life cycle of packets containing DNA material that are received by IXPBlast and the way that the implementation processes this data. The MEs are depicted in the center of the figure. Collectively, they
SDRAM
circular buffer of packets containing DNA nucleotides 0
1
CTAAGGT
2
3
4
5
6
7
8
9
ATGCAA AAGTCA CCCGTA GATTAC ACGTAA CCGATT GGACGA TTAGTA
ACC
3 1 ME0
ME1
ME2
ME3
ME4
ME5
Receive RFIFO
Network
2a Aho-Corasick
Query (e.g."ACGTAACCGATTGGA...")
2b Scratch
"Trie"
IXP1200
a c g t
SRAM
10.3 FIGURE
Processing packets in IXPBlast.
206
10
On the Feasibility of Using Network Processors for DNA Queries
are responsible both for receiving DNA db and for matching the DNA material in this sequence against the query. We will now discuss the numbered activities in some detail. 1. Packet reception. In this step, DNA db is sent to the NPU across a network in fixed-length IP packets (of approximately 500 bytes payload) and received (in chunks of 64-byte mpackets) first in the NPU’s RFIFO and subsequently in SDRAM. We have dedicated one ME (ME0 ) to the task of packet reception. The ME is responsible for (a) checking whether a packet has arrived from the network, (b) transferring the DNA material to a circular buffer in SDRAM, and (c) indicating to the remaining five MEs that there is DNA material to process. 2. Query in SRAM (or Scratch). The query is stored in faster memory than the packets. In the implementation of Aho-Corasick the query is stored in SRAM (2a), as it does not fit in the 4 KB scratch memory, except for very small queries. In the naive implementation (IXPBlastdirect ), the entire query is stored in on-chip scratch memory for efficiency (2b). As Scratch memory is only 4 KB on IXP1200s, this limits the size of the query. In our implementation, as discussed in Section 10.2, we are able to store more than 104 nucleotides in scratch memory, far exceeding the length of most queries in practice. 3. Scoring. While ME0 is dedicated to packet reception, the threads on the remaining MEs implement the scoring algorithm described in Section 10.1.1, that is, every window in the query is compared to every window in DNA db (and the status is kept for all matches). Phase 2 of the Blast algorithm (the alignment, which takes place on a Pentium processor) starts only after the entire DNA db has been scored. Note that when processing a packet, a thread really needs to access two packets. The reason is that the last few windows in a packet in DNA db straddle a packet boundary. For example, assume the window size is 12 nucleotides and a thread T processing packet pi is about to process the last 11 nucleotides in the packet. In this case, the window spills into packet pi+1 . As a result, T must access up to 11 nucleotides of pi+1 . Conversely, all packets except the first need to be accessed by two threads: the one responsible for this packet and the one responsible for processing the previous packet. We will call these threads the packet’s main thread and spill thread, respectively. Processing takes place in batches of B packets. For example, if B = 100.000, this means that 100.000 packets are received and processed in their entirety before the next batch is received (this is not to say, however, that all B packets first have to be received before the processing can start). For this purpose, every thread maintains a packet index in scratch memory. Each of these indices has to
10.1
Architecture
point to the end of the batch buffer, before the next batch of B packets arrives. Otherwise, there is a risk that the entire batch cannot be stored. So if any of the indices lags behind, an error indication is generated and the system dies. Obviously, this is not very efficient. For example, if an index points to packet number B − 1, there would be B − 2 buffer slots available for receiving packets and there is no need to stop the application. In other words, the current implementation is over-conservative. We plan to fix this in a future implementation. It should be stressed that each of the MEs processes only 20 percent of all packets in the circular buffer. Finally, as shown in Figure 10.3, consecutive MEs do not need to process consecutive packets. For instance, ME3 in the figure has overtaken ME4 and ME5 .
10.1.4
Aho-Corasick The code running on microengines ME1 –ME5 executes the Aho-Corasick stringmatching algorithm. It is not our intention to explain the algorithm in detail and interested readers are referred to Ref. [8]. Nevertheless, for clarity’s sake, a brief description of how the algorithm works at runtime is repeated here. For simplicity, we assume that the window size is 3. Consider a query consisting of the following windows: {acg,cgc,gcc,ccg,cga}. This is a query of length 7 that consists of five windows. The behavior of the pattern-matching machine is determined by three functions: a goto function g, a failure function f , and an output function output. For the example query, these functions are shown in Figure 10.4. State 0 is the start state. Given a current state and an input symbol from DNA db (the next nucleotide to match), the goto function g makes a mapping to a next state, or to the message fail. The goto function is represented as a directed graph in the figure. For example, the edge labelled a from 0 to 1 means that g(0, a) = 1. The absence of an arrow indicates fail. Whenever the fail message is returned, the failure function is consulted, which maps a state to a new state. When a state is a so-called output state, arriving here means that one or more of the windows have matched. In IXPBlast, this translates to recording both the corresponding positions in DNA db and the window(s) that matched. This is formalized by the output function that lists one or more windows at each of the output states. The pattern-matching machine now operates as follows. Suppose the current state is s and the current nucleotide of DNA db to match is x: 1. If g(s, x) = s , the machine makes a goto transition. It enters state s’ and continues with the next nucleotide in DNA db . In addition, if output(s) = empty, the windows in output(s ) are recorded, together with the position in DNA db .
207
10
208 (a)
0 t
On the Feasibility of Using Network Processors for DNA Queries
a c
1 4
c g
2 5
g c a
c g
10
c 7
g
3 6 12 11
c 8
9
Goto function 1 2 3 4 5 6 7 8 9 10 11 12 (b) i f(i) 0 4 5 0 7 8 0 4 10 4 5 1 Failure function (c)
i 3 6 9 11 12
output(i) acg cgc gcc ccg cga
Output function
10.4
The Aho-Corasick pattern-matching machine.
FIGURE
2. If g(s, x) = fail, the failure function is consulted and the machine makes a failure transition to state s = f (s). The operating cycle continues with s as the current state and x as the current nucleotide to match. Suppose for instance that the following subset of DNA db is processed: tacgcga. Starting in state 0, the algorithm first processes t which brings about no (real) state transistion. Next, a is processed, which leads to a move to state 1 on a goto transition. Subsequent goto tranistions for the following two nucleotides, c and g, respectively, move the algorithm to state 3. State 3 happens to be an output state, so the window acg has matched. The next nucleotide in the input is c. As there is no goto transition from state 3 for this nucleotide, the algorithm makes a failure transition to state 5, continuing with the same input nucleotide c. From state 5 it makes a goto transition to state 6, which again is an output state (the window cgc was matched). The next nucleotide in the input is g. Again, a failure transition is made, this time to state 8 and processing continues with nucleotide g. The next failure transition leads to state 4 and from there a goto
10.1
Architecture
transition can be made bringing us to state 5. The nucleotide to match is a for which, again, a goto transition is possible. The new state is 12, which is an output state for the window cga. At this point, all three windows hidden in the input string have been matched. This explanation describes the essence of the algorithm. In their paper, Aho and Corasick describe how it can be implemented even more efficiently as a finite automaton that makes exactly one state transition per input nucleotide. There are a few observations to make. First, the operation of the algorithm at runtime is extremely simple. Indeed, the complexity of Aho-Corasick is not the operating cycle, but the generation of the appropriate trie. The construction of the trie consists of executing a sequence of three straightforward algorithms. The precise nature of these algorithms is beyond the scope of this chapter and interested readers should refer to the original paper by Aho and Corasick. Second, the algorithm matches each nucleotide of DNA db against all windows at once. This means that the algorithm scales well for increasing numbers of windows and window sizes. Indeed, the size of the query has hardly any impact on the processing time, in contrast to most current implementations of the Blast algorithm. Even without considering implementation on NPUs, this is an interesting property. Third, the algorithm requires a relatively large amount of memory. Recent work has shown how to make the Aho-Corasick algorithm more memory efficient [9], but in this chapter the original algorithm was used, as it is faster and a few MB of SRAM is more than enough to store even the largest BLAST queries.
10.1.5
Nucleotide Encoding In DNA there are only four nucleotides, A, C, G, and T, so two bits suffice to encode a single nucleotide. In fact, in the Aho-Corasick version of IXPBlast both the query and DNA db are encoded in this fashion. When encoding DNA db in the naive implementation, however, some encoding efficiency was sacrificed for more efficient computation (making the implementation slightly less naive). The advantage of an alternative encoding is that it enables us to reduce the number of shift and mask operations that are needed in the comparisons on the microengines. The way this is done is by performing these operations a priori at the sender’s side. For example, if in a hypothetical system the window size is two nucleotides, the minimum addressable word size of the memory is four bits and is DNA db ACGT, several shift/mask operations may be necessary to get to the right bits if an optimal 2-bit encoding is used. Let’s assume that A = 00, C = 01, G = 10, and T = 11. In this case, DNA db is encoded as 0001 1011 and to get to window CG some shifting and masking would be hard to avoid. However, no shift/masks operations are needed if the following encoding
209
10
210
On the Feasibility of Using Network Processors for DNA Queries
is used: 0001 0110 1011 11 . . . (to each original encoding of a nucleotide the original encoding of the next nucleotide is appended). Now each consecutive window is found simply by reading the next 4-bit word and comparing this word to the query window directly. Of course, a less efficient encoding means more overhead in transferring the data in and out of memory and microengines. In practice, we discovered that an encoding of DNA db of 4 bits gave the best performance.
10.2
IMPLEMENTATION DETAILS DNA db is sent to the IXP from a neighboring host across a gigabit link. The sender uses mmap to slide a memory-mapped window over the original DNA db file. The contents of the file that is currently in the window is memory mapped to the sending application’s address space, allowing the sender to read and send the data efficiently. Packets are sent in fixed-size UDP/IP Ethernet frames of 544 bytes where the UDP header is immediately followed by a 6-byte pad. The pad ensures that the DNA payload starts at an 8-byte boundary (14 B Ethernet + 20 B IP + 8 B UDP + 6 B pad = 48), which is convenient as the minimum addressable unit of SDRAM is 8 bytes. Careful readers may have noticed that this UDP setup is problematic. As we have said that the current IXPBlast implementation on the IXP1200 cannot keep up with the linespeed, without some feedback mechanism the sender has no way of knowing how fast it can send the data without swamping the receiver. In the current version, we have ignored this problem and manually tuned the sender in such a way that packets are never dropped by the receiver. In the implementation of IXPBlast, two threads on ME0 are dedicated to receiving packets from the network and storing them in SDRAM. A third thread synchronizes with the remaining MEs. Dedicating more than two threads to packet reception does not improve performance. The implementation is much simpler than in most networking applications: ME0 loads data in SDRAM in batches of 100,000 packets. As soon as packets are available, the processing threads start processing them. They synchronize with ME0 by reading a counter each time they have finished processing a packet. If no new packet is available, they wait. On ME1 –ME5 this is essentially the only synchronization that is necessary for a block of 100,000 packets to be processed. In network systems, such as routers, such simplistic synchronization is generally not permitted. On ME1 –ME5 two threads are used to process the data, resulting in ten packet-processing threads in total. Again, using more threads did not improve the performance and even resulted in performance degradation (possibly due to
10.3
Results
211
the additional resource contention). Each threadi (0 ≤ i < 10) is responsible for its own subset of the packets as follows: threadi processes packets i, i + 10, i + 20, i+30, . . .. The thread compares each of the windows in the query to each window in the packet for which it is the main thread and to each window that starts in this packet and spills over into the next packet. For IXPBlastdirect , this means a direct comparison with each of the windows in the query, while for IXPBlast it means parsing one molecule at a time and comparing to all the windows at once. IXPBlast works with a window size of 12 (or three amino acids), a size suggested in the literature [10]. Each window has a score list, kept in SRAM that tracks the positions in DNA db that match the window. Whenever a window match is found, the location in DNA db is noted in the window’s scorelist. After DNA db has been processed in its entirety, the scores are the end-result of Phase 1. They are used as input for the remaining sequence alignment algorithm (Phase 2) on the Pentium. In IXPBlast, all initialization of memory, receiving of packets, storing of packets in SDRAM, and synchronization with the remaining MEs takes up 235 lines of microengine C code. The packet processing functionality on ME1 –ME5 was implemented in 292 lines of code.
10.3
RESULTS To validate our approach, IXPBlast was applied to the current release of the Zebrafish shotgun sequence, a genome of a little over 1.4 · 109 nucleotides, using a query that consisted of the 1611 molecules long cDNA sequence of the Zebrafish MyD88 gene [11]. The window size is 12 molecules, so the number of windows is 1600. The results are compared to an equivalent implementation of the algorithm on a Pentium. Throughout this chapter, the process of obtaining DNA db (either from disk or from the network) and storing it in memory is referred to as DNA db loading. As explained in Section 10.2, the loading process in the current implementation of IXPBlast is somewhat inefficient. The reason is that, due to the lack of a feedback mechanism, the sender is forced to send at an artifically low rate to prevent the IXP from being swamped. As a result, when transferring the Zebrafish genome (approximately 1.4×109 nucleotides), the sender spends a small amount of its time sleeping. For the implementation on the Pentium, the sequence was read from a local disk and stored in host memory. In the measurements reported in clock cycles, the overhead of loading is not included. Arguably, doing so introduces a bias in the results to our advantage. We think this is not very serious, for two reasons. First, we are trying to evaluate the use of network processors for
TA B L E
10.2
1611
1611
1611
1611
1611
200
200
1611
2
3
4
5
6
7
8
9
IXP1200, 232MHz with implementation of Aho-Corasick as deterministic finite automaton
IXPBlast, IXP1200, 1 thread only (simulated)
1792 (1 max-match pkt)
IXPBlast, IXP1200, 1 thread only (simulated)
IXPBlast, IXP1200, 1 thread only (simulated)
1.4 × 109
1792 (1 no-match pkt)
1792 (1 max-match pkt)
IXPBlast, IXP1200, 1 thread only (simulated)
IXPBlast, IXP1200 (232 MHz) IXPBlast, IXP1200, 1 thread only (simulated)
1792 (1 real pkt) 1792 (1 no-match pkt)
Pentium-4 (1.8 GHz)
1.4 × 109
IXPBlastdirect , IXP1200 (232 MHz)
8.96 × 106 (104 pkts) 1.4 × 109
Implementation
DNA db size (nucleotides)
Results for various implementations of Blast.
792
Query size (nucleotides)
1
Exp
303,821
288,373
418,047
288,655
301,028
not measured
174,741,844,482
not measured
Cycles
90
not measured
not measured
not measured
not measured
not measured
129
132
140 seconds
Time (s) (incl. loading)
10.3
Results
DNA processing and for this the loading across the network is an orthogonal activity (as mentioned earlier, one could use the same loading method as used for the Pentium). Second, including loading in the current implementation of DNA db in the comparison does not make much sense since a significant number of cycles are idle cycles, that are spent waiting for the sender to wake up. However, the total time in seconds is also reported and this does include the loading time. For the Pentium implementation this involves only reading from the hard drive. For the implementation on the IXP, this involves both reading from the hard drive and transmission across the network, so in this case the results would seem too biased in the advantage of the Pentium. The reason that the total time on the Pentium is longer than the one on the IXP is probably caused by the fact that in the former case reading DNA db and processing the data are done sequentially, while in the latter case these are fully parallelized activities. The results are summarized in Table 10.2. Experiments 4–8 in Table 10.2 were obtained from the cycle-accurate IXP simulator, provided by Intel. Experiments 5–8 will be explained later. Experiment 4 is listed to get a handle on how many cycles are typically spent per packet by a single thread. We can use this result to verify the total processing time. DNA db consists of a total 810,754 packets of 1792 nucleotides each, so that the total time to process the entire DNA db by a single thread (not including loading) can be estimated as (301,028×810,754)/(232×106 ) = 1051 sec. For ten threads the time would be roughly 105 seconds. If we include the overhead incurred by the loading (including sleeps), a total result of 129 seconds as measured is very close to what we expected. The number of cycles per molecule is 301,028/1792 = 168 cycles. As the current implementation has not focused much on optimization, we feel confident that this number can be brought down in future versions of the code. It should be mentioned that for the measurement, only a single packet was processed, as simulating the processing of the entire genome takes an exceedingly long time. Interestingly, the performance of the 1.8-GHz P4 is very close to that of the 232-MHz IXP1200. The number of clock cycles in experiment 1 on a Pentium 4 does not include loading. Converting clock cycles into seconds, the processing time on a Pentium comes to 97 seconds (slightly better than the estimated actual processing time on the IXP). With the naive implementation, IXPBlastdirect (shown as Experiment 1 of Table 10.2), processing even a mere 10.000 packets (containing only 792 nucleotides each) takes as long as 140 seconds. With this implementation it would take approximately 359.5 minutes (almost six hours) to process the entire genome. Recently, we also implemented the more efficient implementation of Aho-Corascik, in which all failure transitions are eliminated and the entire algorithm is reduced to
213
10
214
On the Feasibility of Using Network Processors for DNA Queries
a deterministic finite automaton. The total processing time on the IXP now comes to just 90 seconds. Observe that the processing time in IXPBlast is determined by the size of DNA db and to a much lesser extent by the size of the query. The explanation is that Aho-Corasick always leads to a state transition when a new nucleotide is processed (even though the old and new state may be the same). The only difference is in the size of the trie in memory and the number of matches of windows that are found for a query. A match results in slightly more work than no match, as the score-keeping must be done in SRAM. In the simulator, subsets of the original 1611 nucleotides MyD88 query were used to construct queries of 200,400, . . . ,1400 nucleotides. All queries were used to process two packets each with a length of 1611 nucleotides: (1) where there were no matches, and (2) where the packet was exactly the same as the 1611 nucleotide query. Both experiments are run at the same time with one thread on one ME processing case (1) and another thread on a different ME processing case (2). As a result there will be some interference between the two experiments (but very little as each experiment runs on a separate ME). If anything, the results will be better when run in complete isolation, rather than worse. Case (1) incurs no updates of SRAM, and is considered the fastest possible case. Case (2) represents the maximum workload as it will match every window in the 1611 nucleotide query. It also exhibits the greatest difference between the queries, as the short sequences will only encounter a larger number of matches in the beginning of the packet, but much fewer toward the end (see also Table 10.3). The two extremes (200 and 1611 nucleotides) are shown as Experiments 5–8 in Table 10.2. As shown in Figure 10.5, there is hardly any difference in performance for the packet with no matches. These results are for the implementation of the Aho-Corasick algorithm as a deterministic finite automaton. We speculate that
Aspect measured
Results
Pkt size (including headers)
544 bytes
Number of packets in DNA db
810,754
Pkt reception (for 1 pkt of 544 bytes)
1677 cycles
Size of trie for 1611 nucleotide query
103,12 states (≈240 KB)
10.3 TA B L E
No. of matches for 400 nucleotide query
109,142
No. of matches for 1611 nucleotide query
524,856
Related IXPBlast measurements.
10.4
Related Work
215
Cycles
Cycles for minimum and maximum match (for different queries) 450,000 400,000 350,000 300,000 250,000 200,000 150,000 100,000 50,000 0
Min match Max match 200
10.5
400
600 800 1000 1200 1400 1611 Query size (nucleotides)
Cycle counts for different queries.
FIGURE
the varying sizes of the tries would have made more of a difference in the presence of a cache, but in the IXP1200 no caching takes place, so all accesses to the trie are accesses to SRAM. For this reason, the size of the trie has little impact on total performance. Recall that in these experiments an old version of the IXP was used, while newer versions offer clock speeds comparable to Pentiums (in addition to faster memory and more MEs). For example, an IXP2800 runs at 1.4 GHz and offers 16 MEs with 8 threads each. It is tempting to say that the “competition” between the IXP and the Pentium will be won simply by the IXP2800’s faster clock alone, but this conclusion may be premature. Operating at higher clock rates may increase stalls, for example, due to memory bus contention. Nevertheless, considering the state-of-the-art in NPUs, it is safe to say that the speed of even the naive implementation may be improved signifantly by better hardware and, given that results of IXPBlast are competitive even with the current hardware, the results are promising.
10.4
RELATED WORK To our knowledge, there is no related work in applying NPUs to DNA processing. In bio-informatics there exists a plethora of projects that aim to accelerate the Blast algorithm, using one of the following methods (1) improving the code (e.g., Ref. [12]), (2) use of clusters for parallel processing
10
216
On the Feasibility of Using Network Processors for DNA Queries
(e.g., Paracel’s BlastMachine2 [6], and (13) implementation on hardware such as FPGAs (e.g., Decypher [14, 5, 4]). While IXPBlast is related to the work on cluster computers, it has an advantage in that is cheaper to purchase and maintain than a complete Linux cluster. Implementations of Blast on FPGAs are able to exploit even more parallelism than IXPBlast. On the other hand, they are harder to program (and programmers with the necessary VHDL skills are scarce). Because of this, FPGAs are not very suitable for rapid development of new types of DNA processing tools. Moreover, NPUs expect to increase their clock rates at roughly the same pace as current microprocessors such as the Pentium, meaning that the same code experience a performance boost for free. The clock speeds of FPGAs, on the other hand, are not expected to grow quite so fast. Variations of the Aho-Corasick algorithm are frequently used in network intrusion-detection systems like the newer versions of Snort [3]. While we are not aware of any implementation of the algorithm on network processors, there is an implementation of Snort version 1.9.x on an IXP2400 with a TCAM daughter card from IDT that implements the Boyer-Moore algorithm [13]. Boyer-Moore is similar to Aho-Corasick, but is limited to searching for a single pattern at a time.
10.5
CONCLUSIONS In this chapter, it was demonstrated that NPUs are a promising platform for implementing certain algorithms in bio-informatics that could benefit from the NPU’s parallelism. We have shown how the Blast algorithm was successfully implemented on a fairly slow IXP NPU. The performance was equivalent to an implementation on a Pentium running at a much higher clock rate. As we have made few attempts to optimize our solution, there can be no doubt that NPUs are an interesting platform in fields other than networking in general and in DNA processing in particular. Although the Aho-Corasick algorithm itself is fairly efficient, many improvements of Blast have been proposed in the literature. Moreover, several simple changes to IXPBlast are expected to boost performance significantly. First, a trivial improvement for IXPBlast is to switch to amino acids (effectively looking at three nucleotides at a time). Second, the Aho-Corasick algorithm that was used in IXPBlast can itself be improved. Research projects exist that make the algorithm either faster or more memory efficient (e.g., Ref. [9]). So far, we have not exploited any of these optimizations. Third, while implementing the feedback mechanism between sender and IXP will not change the number of cycles spent on processing the data, it will decrease the duration of the process
References
217 as experienced by the user. Fourth, multiple network processors can be used in parallel by plugging in multiple ENP NPU boards in the PCI slots of a PC to speed up the process even more. We conclude this chapter with a speculation. So far, we have described the application of an architecture from the world of networking to the domain of bio-informatics. It may well be that the reverse direction is also useful. In other words, it would be interesting to see whether a similarity-search algorithm could be fruitfully applied to scanning traffic, example, for a family of worms exhibiting a degree of family resemblance. Instead of looking for exact patterns in a flow, this would look for patterns that are similar to the query. For this to work, more research is needed to investigate whether worms that exploit a certain security hole sufficiently resemble each other to allow Blast to separate them from normal traffic.
ACKNOWLEDGMENTS Our gratitude goes to Intel for providing us with IXP12EB boards and to the University of Pennsylvania for the loan of an ENP2506. Pattern matching on IXPs started within the EU IST SCAMPI project. We are indebted to Bart Kienhuis for early discussions about this idea and to Fons Verbeek for providing the knowledge about bio-informatics and for supplying the DNA material. Shlomo Raikin has been helpful in comparing our approach to the ones that use FPGAs. A massive thank-you is owed to Hendrik-Jan Hoogeboom for pointing us to the Aho-Corasick algorithm, which resulted in a speed-up of several orders of magnitude.
REFERENCES [1]
Kurt Keutzer and Niraj Shah, “Network processors: Origin of species,” Proceedings of ISCIS XVII, The Seventeenth International Symposium on Computer and Information Sciences, October 2002, pp. 41–45.
[2]
S.F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, October 1990, vol. 215, pp. 403–410.
[3]
M. Roesch, “Snort: Lightweight intrusion detection for networks,” Proceedings of the 1999 USENIX LISA Systems Adminstration Conference, 1999. Available from www.snort.org/.
[4]
Shlomo Raikin, Bart Kienhuis, Ed Deprettere, and Herman Spaink, “Hardware implementation of the first stage of the blast algorithms,” Technical Rpt., TR-2004-08, Leiden Institute of Advanced Computer Science, Leiden University 2004.
[5]
Compugen Ltd., “Bioccelerator product information,” eta.embl-heidelberg.de:8000/Bic/docs/bicINFO.html, 1995.
10
218
On the Feasibility of Using Network Processors for DNA Queries
[6]
Paracel, “Blastmachine2 technical specifications,” www.paracel.com/products/blastmachine2.php, November 2002.
[7]
Erik J. Johnson and Aaron R. Kunze, IXP1200 Programming, Intel Press, 2002.
[8]
Alfred V. Aho and Margaret J. Corasick, “Efficient string matching: An aid to bibliographic search,” Communications of the ACM, June 1975, vol. 18, pp. 333–340.
[9]
Nathan Tuck, Timothy Sherwood, Brad Calder, and George Varghese, “Deterministic memory-efficient string matching algorithms for intrusion detection,” Proceedings of the IEEE Infocom Conference [1], pp. 333–340, March 2004.
[10]
J. W. Kent, “BLAT—The BLAST-like alignment tool,” Genome Research, 2002, vol. 12, pp. 656–664.
[11]
K. A. Lord, D. A. Liebermann, and B. Hoffman-Liebermann, “Nucleotide sequence and expression of a cDNA encoding MyD88, a novel myeloid differentiation primary response gene induced by il6,” Oncogene, July 1990, vol. 5, pp. 1095–1097.
[12]
TimeLogic White Paper, “Gene-BLAST—An Intron-Spanning Extension to the BLAST Algorithm,” www.timelogic.com/company_articles.html, 2001.
[13]
Consystant DeCanio Engineering, “The Snort network intrusion detection system on the Intel ixp2400 network processor, white paper,” www.consystant.com/technology/, February 2003.
[14]
M. Gollery, “Tera-BLAST—a Hardware Accelerated Implementation of the BLAST Algorithm, White Paper,” www.timelogic.com/company_articles.html, August 2000.
Y FL M A E T
11 CHAPTER
Pipeline Task Scheduling on Network Processors Mark A. Franklin, Seema Datar Department of Computer Science and Engineering Washington University in St.Louis
The continuing increase in the logic and memory capacities associated with single VLSI chips has led to the development of chip multiprocessors (CMPs) where multiple, relatively simple processors are placed on a single chip. Such CMPs permit the effective exploitation of application-level parallelism and thus significantly increase the computational power available to an application. These developments have been exploited by the networking industry where growth in the capacity, speed, and functionality of networking infrastructure, have required fast, flexible, and efficient computing architectures. This has led to the development of network processors (NPs) [1, 2]. Processors within these NPs are often arranged so that they can be used in a pipelined manner. In some cases [2, 3] the NP may contain one or more processor pipelines. A typical architecture that has been analyzed (in a simpler form) from the perspective of obtaining “optimal" power and cache-size designs [4–6] is shown in Figure 11.1. Packets of information arrive from the network and are classified and routed by a scheduler [7] to one of a number of processor pipeline clusters. The clusters are sized so that bandwidth to off-chip memory meets specified performance requirements, and each cluster contains a number of processor pipelines. Applications may be associated with one or more of the processor pipelines. The application itself may be pipelined and may utilize multiple processors within a pipeline. A packet, after being routed to a cluster and pipeline, invokes one or more of the applications associated with the pipeline (e.g., routing, compression, etc.) and, after traversing the pipeline, returns to the scheduler where it is transferred to a switch fabric for transmission to the next node in the network. Given a number of applications that have processor pipeline implementations specified as a series of sequential tasks, a number of issues arise relating to the assignment of application tasks to the pipeline stages/processors. Since the
11
220
Pipeline Task Scheduling on Network Processors
Cluster 1
Cluster m
Off-chip memory
Off-chip memory
Memory channel
ASIC
Memory channel
... Proc
...
...
Proc
Proc
Proc
...
Proc
...
Proc
...
Proc
...
...
...
Proc
Proc
Proc
Proc
...
Proc
I/O channel and demux
Packet demultiplexer & scheduler
To switching fabric
Transmission interface
From network
11.1
Generic network processor.
FIGURE
configuration of pipelines on NPs is often very flexible (with some constraints), choices must be made concerning the number of pipelines, the number of stages per pipeline, the allocation of application tasks to pipeline stages, the sharing of pipeline stages by multiple application tasks, and the partitioning of applications into tasks. Generally one would like to assign application tasks to pipeline stages so that system throughput is maximized. In order to explore the performance of different pipeline configurations, a fast assignment algorithm that yields near optimum results (i.e., maximizes throughput) must be available. This chapter presents a fast heuristic, Greedypipe, that provides such a capability. Note that other considerations, in addition to throughput, may also be important. For example, with applications that have common tasks, sharing of
11.1
The Pipeline Task Assignment Problem
a pipeline stage potentially reduces the amount of on-chip memory required and may also improve cache performance. Another example concerns the length of the pipeline. Additional pipeline stages often improves throughput, but also result in more chip power dissipation. Greedypipe permits one to explore throughput performance effects of pipeline length and stage sharing between applications. The remainder of this chapter presents an approach to solving such assignment problems. We first present a mathematical formulation of the problem (Section 11.1). A greedy algorithm, Greedypipe, that permits rapid and reasonably good assignment solutions is then described (Section 11.2). Section 11.3 explores the use of Greedypipe in optimizing an NP pipeline design and in aiding in the exploration of alternative application algorithm partitionings. In Section 11.4 Greedypipe is applied to several pipeline design problems that occur in the context of network processors. Section 11.5 presents a summary and conclusions.
11.1
THE PIPELINE TASK ASSIGNMENT PROBLEM We now formulate the problem, present the notation to be used and the constraints on task assignment, and develop the system performance metrics. The section ends with a brief review of related work.
11.1.1
Notation and Assignment Constraints Network processors typically have multiple input flows where, in this chapter, a flow is defined as a set of successive functions that must be performed on packets belonging to the flow.1 We consider here application algorithms that may be pipelined and are implemented on a pipeline of identical processors. The general issue of how to develop a pipelined algorithm for a given application is not considered here except for the special cases of Longest Prefix Match (LPM), encryption (AES), and compression (LZW) (discussed in Section 11.4). Each processor in the pipeline operates on a packet (or packet header), does some partial processing associated with the application, and then passes the packet (generally modified) along with other information to the next processor in the pipeline. Typically, there are some parts of a flow’s processing requirements that are common with other flows and thus they may share the processing that is available on one or more pipeline stages. For example, it is necessary to route all 1. Flows correspond to a sequence of functions rather than packets associated with source-destination pairs.
221
11
222
Pipeline Task Scheduling on Network Processors
packets and, for a wide class of flows, common routing algorithms may be used, thus permitting the sharing of a pipeline stage or sequence of stages.2 After passing through the pipeline, the packet is sent into a switch and from there into the network. We assume that there is a steady infinite stream of arriving packets. The incoming flows can be represented as the set F where F = {F1 , F2 , F3 , . . . , FN } and each incoming packet belongs to one of the N flows. The processing associated with each flow can be partitioned into an ordered set of tasks, Tij , corresponding to the application(s) requirements of the flow where i(1 ≤ i ≤ Mj ) and j(1 ≤ j ≤ N), respectively, designate the task and flow number. Mj is the number of tasks associated with flow j. Corresponding to the tasks are times, tij , associated with executing the tasks on a processor stage. Thus, for flow j: Tj = {T1j , T2j , . . . , TMj j } &
tj = {t1j , t2j , . . . , tMj j }
The pipeline, PIPE, consists of R identical processor stages: PIPE = {S1 , S2 , S3 , . . . , SR }. The task assignment problem consists of mapping the full set of tasks onto pipeline stages in a manner that preserves task ordering within a flow and optimizes a given performance metric. The real-time assignment problem performs this mapping in real-time based on running estimates of traffic characteristics. In this chapter, we assume that these characteristics are known a priori and the resulting task assignments are long lived (i.e., we consider the nonreal-time assignment problem). The assignment of task i from flow j to processor stage k can be expressed using the binary variable Xijk where Xijk = 1 if the task is assigned to the processor, and Xijk = 0 otherwise. Thus, the number of tasks on a processor k is given by: M
Pnum.k =
j N
Xijk
(11.1)
j=1 i=1
2. Note that there is generally an overhead associated with moving data between successive stages. This can be easily dealt within the framework provided. However, for a constant overhead between stages, this typically effects the pipeline latency but not throughput and thus does not impact task scheduling.
11.1
The Pipeline Task Assignment Problem
223
Additionally, the following three constraints apply: 1. The assignment process must maintain sequential task ordering. Thus, for l, 1 ≤ l ≤ Mj , for all i, j, k, r if Xijk = 1 and X(i+l)jr = 1 then k ≤ r. 2. A task may be assigned to only a single processor. Thus, for a given task i from flow j, R k=1 Xijk = 1. 3. Consider situations where the same task is associated with multiple flows. Designating tasks to be shared across flows implies that there will be a single instantiation of the shared task and it will be assigned to a single pipeline stage. Thus, for the case of two flows, j, s, and two tasks, i, r, that are the same and are to share the same stage (and code): if Tij = Trs and Xijk = 1 and Xrsm = 1, then k = m. This can be extended naturally to more than two flows. If it is not desired to have such sharing even though the tasks are the same, this can be dealt with by giving the tasks different names.
11.1.2
Performance Metrics While we have now defined the set of possible legal assignments, to determine an optimal assignment, it is necessary to specify a performance metric. The metric of interest in the network processor environment generally relates to maximizing pipeline throughput (i.e., the number of packets per second flowing through the pipeline). We consider the case where there are one or more flows that may share tasks and (for simplicity) a single pipeline. Assume that pipeline throughput is limited by the maximum stage execution time taken over all stages in the pipeline. The execution time for a stage is determined by the tasks assigned to each stage. Thus, the execution time for a single flow j, on stage k is given by: M
sjk =
j
Xijk tij
(11.2)
i=1
and the maximum stage execution time for flow j across all the R stages is: R R Pj = max sjk = max k=1
k=1
Mj
i=1
Xijk tij
(11.3)
11
224
Pipeline Task Scheduling on Network Processors
The maximum stage execution time over all flows and stages is given by: N
N
R
Mj
P = max Pj = max max j=1 k=1 j=1
i=1
Xijk tij
(11.4)
To maximize packet throughput, the problem becomes one of finding a task assignment that minimizes P since the packet throughput ≈ 1/P.
11.1.3
Related Work The assignment problem is similar to a number of problems considered in the literature. At the most basic level, a straightforward method of finding the task assignment that minimizes P is to perform a complete enumeration of all possible assignments, identify feasible assignments, and select the optimium one. However, this suffers from a combinatorial explosion of choices. Additionally, optimally efficient solutions are unavailable since the problem has been shown to be NP Hard [8, 9]. There is a long history associated with related problems in deterministic job-shop scheduling [9] and these problems have been investigated from a variety of perspectives including integer programming, use of heuristics, and other approaches. Similar problems have also been dealt with in the context of finding compilation techniques for general-purpose parallel languages on multiprocessors [10, 11]. The primary objective of the compilation techniques is to minimize the response time while simultaneously reducing overhead due to interprocess synchronization and communication over a general parallel processor. Real-time packet scheduling problems have also been considered in the context of network processors [6]. In this case, however, packets were assumed to be completely processed on a single processor. A primary concern in that work was to assign packets to processors in a manner that minimized the effect of cold-cache misses on performance. Our work is aimed at maximizing the throughput of a pipelined multiprocessor system by effective assignment of flow tasks to pipeline stages. It differs from prior work in a number of ways. Primarily, the problem definition differs from those considered in the past in that we consider multiple flows and pipelines, sharing of tasks on pipeline stages, and a bandwidth performance metric associated with the requirements of the computer pipeline environment. Greedypipe, the heuristic developed, takes these factors into account in its development.
11.2
The Greedypipe Algorithm
11.2
225
THE GREEDYPIPE ALGORITHM Given the basic problem formulation, this section presents the Greedypipe heuristic algorithm, a simple example illustrating task assignment, and a summary of the algorithm’s performance.
11.2.1
Basic Idea Greedypipe is a heuristic based, in part, on a greedy algorithm and thus gives no guarantee of finding an optimal solution. However, it provides solutions quickly and tests indicate that it finds near optimal solutions most of the time. Ideally, one would like an assignment where, for each flow, the total execution times of flow tasks associated with each stage are equal. That is, given the total time for executing the tasks associated with flow j is: M
TotalTimej =
j
tij
(11.5)
i=1
With R stages in the pipeline, as indicated, an optimal allocation of tasks to pipeline stages is one where the execution time for each stage is equal and, under these conditions, the throughput is maximized : Packet.Throughputj = 1/Ideal.Delay.per.Stagej
(11.6)
Thus, with R stages, for flow j, the ideal delay per stage is: Ideal.Delay.per.Stagej = TotalTimej /R
(11.7)
Thus, Step 1 of the Greedypipe is to calculate this ideal delay using Equation (11.7). Actual task times and assignments that satisfy the constraints noted in Section 11.1.1 will, however, generally result in unequal execution times associated at each stage. The best of the possible assignments, however, will be the one(s) that come closest to that ideal. Consider the time for execution of all flow j tasks on stage k as given by: M
tjk =
j
Xijk tij
(11.8)
i=1
Since throughput is calculated from the inverse of the maximum stage execution time, the optimum assignment for flow j is one that minimizes the value of Varj
11
226
Pipeline Task Scheduling on Network Processors
in the expression given by Equation (11.9): R Varj = max tjk − Ideal.Delay.per.Stagej k=1
(11.9)
When multiple flows are present there are potentially shared tasks that complicate task assignment. However, various assignments will meet the previous constraints and selecting the optimal now requires Equation (11.9) to be expanded so that the throughput across all the flows is maximized. This can be achieved by selecting the task to stage assignment that minimizes the maximum Varj across all flows: N
Var = max Varj j=1
(11.10)
This metric attempts to equalize both the distribution of tasks to stages on a per flow basis and also on an aggregate flow basis. Note that, at a given stage k, potentially there may be multiple allocations for which the minimized Var has the same value. A simple tie-breaking algorithm is used that selects the assignment, over all flows, and that minimizes the sum of the differences between the ideal delays and the assigned delays.
11.2.2
Overall Algorithm The overall heuristic begins by calculating the Ideal.Delay.per.Stage for each of the flows. Task-to-stage allocations start with the first processor stage. Two sets of tasks, satisfying the constraints, are selected from each of the flows for allocation to this processor. The first set is chosen so that the variation, Varj , given by Equation (11.9) is minimized and is also a positive number. The second set is chosen similarly; however, Varj is required to be a negative number. Additional sets that satisfy the constraints may be chosen at the cost of increased complexity and execution time. At this point, there are two allocations associated with each flow for the first processor and there are thus (2)N possible combinations of flow allocations. Each of these combinations is examined and the “best” two are kept for use in performing task assignments for the next pipeline stage. The best two correspond to the two that, for this stage, minimize Var as expressed in Equations (11.9) and (11.10) with R = 1. Assignments for the next pipeline stage are now considered. The process begins by first calculating new Ideal.delay.per.stage values based on unallocated tasks and the number of remaining pipeline stages. Next, each of the two best
11.2
The Greedypipe Algorithm
11.1
227
Task Number
1
2
3
4
5
Flow 1
T1
T2
T3
T4
T5
Task Execution Times
4
2
3
1
3
Flow 2
T6
T7
T8
T2
Task Execution Times
5
1
1
2
Two flows with ordered tasks.
TA B L E
11.2
Stage 1
Stage 2
Stage 3
T1 , T6 : T = 5
T2 , T3 , T7 , T8 : T = 5
T4 , T5 : T = 4
Greedypipe task assignment, maximum stage time = 5.
TA B L E
allocations from the prior stage is used as a starting point for determining the best task-to-stage assignments for the current stage. For each of these and for each flow, two “best” assignments (positive and negative) are selected. As before, all combinations of these flow assignments are then examined and the two that minimize Var [Equation (11.10)] with R = 2 are now kept as starting points for considering the next pipeline stage (Stage 3). This process continues until all stages in the pipeline are examined and a complete assignment has been done. The best of the final Stage 2 assignments is now kept. Notice that the algorithm has an implicit ordering aspect to it such that tasks and stages are considered in their first-to-last order. While in general this does well, given that local conditions determine allocations at each stage, it will not always result in the optimal allocation. To improve the results one can apply the same heuristic, however, starting with the last task and stage, and applying the heuristic in a last-to-first order. Thus, in Greedypipe the algorithm is applied in both directions with the final assignment being the best of the two. As an example, consider the case of a single three-stage pipeline that is to handle two flows having five and four tasks respectively with the flows sharing one task (Task 2). The flows are defined in Table 11.1. The results of executing Greedypipe are shown in Table 11.2.
11.2.3
Greedypipe Performance There are two elements associated with evaluating performance. The first concerns how closely Greedypipe results match the true optimal results. While no
11
228
Pipeline Task Scheduling on Network Processors
analytic bounds on the errors have been developed, extensive experimentation has been performed where the results of Greedypipe were compared with the true optimal as obtained by running an exhaustive search algorithm. For a wide range of randomly generated conditions, 98 percent of the time Greedypipe results are within 15 percent of the optimal and in no case was the result greater than 25 percent from the optimal. The results, however, varied with the number of pipeline stages and the percentage of shared tasks associated with different flows. Shared tasks present interesting complications and generally result in somewhat higher errors when the percentage of sharing is greater than 25 percent. The second aspect of performance concerns execution time. In the previous experiments with systems containing four or five stages, one to three flows, and 12–15 tasks per flow, exhaustive searches were about three orders of magnitude slower than Greedypipe searches, which took on the order of a second (using a 500-MHz Sparc processor).
11.3
PIPELINE DESIGN WITH GREEDYPIPE In systems, such as network processors, with multiple pipelines and flows, determining the best pipeline and algorithm partitioning and pipeline stage assignment is difficult. The designer typically has a number of tradeoffs to consider, including the following: ✦
Number of pipeline stages and number of pipelines. Given applications, and associated flows, that have been partitioned into a number of ordered tasks, a designer can select the number of pipeline stages to implement. Up to a point, more stages generally result in higher throughput; however, more stages also require more chip area and high power consumption. Greedypipe can be used to determine just what throughput can be achieved with a varying number of stages and pipelines.
✦
Algorithm task sharing. When multiple flows and associated applications are present, there is often an opportunity for the sharing of applications or of individual tasks across flows. This may result in smaller overall code space being required which, in turn, may reduce the cost of on-chip memory, or increase performance due to reduced instruction cache miss rates. However, when tasks are shared, there is less flexibility in task-to-stage assignments and generally lower overall throughput results. Greedypipe permits fast determination of the performance effects related to task sharing.
✦
Algorithm partitioning. For many applications, alternative algorithm-to-task partitionings are possible. For a given pipeline, each partitioning, after
11.3
Pipeline Design with Greedypipe
229
assignment, generally leads to different throughput results. Greedypipe can be used to determine those tasks that are performance bottlenecks and what performance gains can accrue from task repartitioning. Up to a point, for a fixed pipeline design, this may result in higher throughput, however, at the cost of algorithm and software redesign. The sections that follow illustrate the use of Greedypipe in these sorts of design activities. In each subsection, figures illustrating the results of a number of experiments are provided. Each data point presented represents the results of averaging forty experiments. In each experiment Greedypipe was used to generate a task-to-pipeline assignment and the task times were randomly selected from a uniform distribution ranging from zero to ten time units.
11.3.1
Number of Pipeline Stages Greedypipe may be used to determine the effect of pipeline depth on throughput performance. This is illustrated in Figure 11.2, where the results for a system with six flows, and 20 tasks per flow is shown. As expected, the throughput increases with the number of stages, and with this many flows and tasks, the increase is almost linear until one reaches about 16 stages. After that, it is more difficult to evenly distribute the tasks over the stages and the throughput asymptotically approaches its maximum of 0.1 [i.e., 1/(maximum task time)]. This maximum is a result of the fact that it is likely that 0.12
Throughput
0.10
# Flows = 6, # Tasks = 20, % Task sharing = 0
0.08 0.06 0.04 0.02 0
11.2 FIGURE
2
4
6
8
10 12 14 16 Number of stages
Throughput vs. number of pipeline stages.
18
20
22
11
230
Pipeline Task Scheduling on Network Processors
there is at least one task that is generated that has the maximum time. Similar results are obtained when one plots the throughput as a function of the number of pipelines.
11.3.2
Sharing of Tasks Between Flows Task sharing between flows leads to an interdependence between the flows. This may be advantageous and result in better memory utilization and lower instruction cache miss rates; however, it also restricts the number of assignment options and thus potentially reduces the maximum throughput. Experiments were conducted for the case of six flows, 20 tasks per flow and a single eight-stage pipeline where the fraction of tasks for each flow that are shared with other flows was varied. Thus, a 50 percent level of sharing means that 50 percent of tasks for each of the flows are common with the other flows. As expected, the results (see Figure 11.3) indicate a significant decrease in throughput as more tasks are shared between flows. The decrease is over 35 percent when one moves from 0 percent sharing to 100 percent sharing. In a full design analysis, this would be balanced against the potential gains noted earlier.
11.3.3
Task Partitioning For many applications that can be implemented in a pipelined manner, there is a choice concerning the task partitioning of the application. Having more tasks generally results in both having greater flexibility in assigning the tasks to the hardware pipeline and in being able to use longer pipelines. This usually results in higher throughput. However, there are two potential drawbacks. First, it can be difficult to divide tasks beyond some basic application partitioning and thus there may be a nontrivial personnel cost associated with this job. Second, greater task partitioning often results in larger intertask communications costs that may increase latency and reduce throughput. However, in order to make a judgement as to whether increased task partitioning is worthwhile considering, it is first necessary to determine the potential performance gains that might result from such an endeavor. Greedypipe permits one to consider the possible gains from additional task partitioning. The effects of task partitioning on throughput are illustrated in Figure 11.4. For both curves presented, the experiments had two flows, 11 tasks per flow, a single five-stage pipeline, and no task sharing. With the lower curve, the longest task in each flow is successively divided into two equal tasks and then Greedypipe
11.3
Pipeline Design with Greedypipe
231
0.048 # Flows = 6, # Tasks = 20, Stages = 8 0.046
Throughput
0.044 0.042 0.040 0.038 0.036 0.034 0.032 0.030 0
11.3
20
40 60 Percent task sharing
80
100
Throughput vs. percent shared tasks.
FIGURE
0.082 # Flows = 2, Initial # Tasks = 11, # Stages = 5 % Task sharing = 0
0.080
Throughput
0.078 0.076 0.074 0.072 Two largest tasks/flow partitioned Single largest task/flow partitioned
0.070 0.068 0
11.4 FIGURE
1
2
3 4 5 6 7 Partition cycle number
Throughput vs. task partitioning.
8
9
10
11
232
Pipeline Task Scheduling on Network Processors
is used to find a new task assignment. The Partition Cycle Number corresponds to how many times this division has occurred (a new maximum task is determined and divided on each cycle). With the upper curve, the two longest tasks in each flow are divided in a similar manner and the throughput is obtained. Both cases are beneficial since both provide more opportunities for improved task assignments that aim at equalizing the delay associated with each stage (and thus maximize throughput). This permits the designer to determine the potential benefit associated with spending more time on algorithm partitioning.
11.4
A NETWORK PROCESSOR PROBLEM The Intel IXP 2800 is an example of an NP that can potentially be configured as a set of processor pipelines where the total number of processor stages is sixteen. Consider now a workload where there are the following three flows: 1. Longest Prefix Match (LPM). A flow where the only function to be performed is that of routing the packet using the LPM algorithm. 2. LPM and Compression. A flow where the NP must perform LPM and also compression on the packet payload. 3. LPM and Encryption. A flow where the NP must perform LPM and also encryption on the packet payload. Software implementations for LPM, Compression, and Encryption are available and may provide adequate performance at the edges of the network. However, as one moves towards the core of the network, real-time constraints generally require the use of fast special-purpose hardware. An alternative to providing such special hardware is to utilize a general processor pipeline for these applications. As discussed later, pipelined implementations of these functions are available. Say that our objective is to maximize overall throughput given the number of available pipeline processor stages. Given general pipelined implementations of the functions and their associated task times, the design space for maximizing throughput includes the following choices:3 1. Number and length of pipelines. Both the number of pipelines and the length of each pipeline can be selected subject to the constraint that the total number 3. Two important issues are omitted in this discussion. First, we are not considering multithreading effects. However, if we assume a high enough level of multithreading so that all off-chip memory latencies are masked, then our rough throughput analysis is not significantly affected by this assumption. Second, we are not considering other issues associated with the memory hierarchy. That is, we are assuming that contention for common resources is not appreciable. Extensions to this work will bring these effects into the model.
11.4
A Network Processor Problem
of processors over all the pipelines must be ≤ the number of processors available. 2. Number of tasks per function. Given a general approach to implementing each of the functions as a software pipeline of ordered tasks, just how should a particular set of tasks be selected. 3. Assignment of function tasks to pipeline stages. Given the first two items, assign each of the tasks to the pipeline stages in a manner that maximizes the overall throughput of the NP. With Greedypipe it is a relatively simple matter to explore key aspects of this design space. Given a pipelined function implementation (item 2) and a choice of number and length of pipelines (item 1), Greedypipe will choose a near optimal assignment of tasks to stages (item 3). One can then iterate over a set of allowable pipeline configurations (item 1) and obtain a near optimal overall design. We next review the flow functions and some pipelined implementation options.
11.4.1
Longest Prefix Matching (LPM) Performing IP address lookup for packet routing is a key operation that must be performed by routers and such lookups require that a Longest Prefix Matching (LPM) algorithm be executed [12]. Because of its central role [13], NPs often include facilities to perform fast IP prefix matching often using a combination of software and special-purpose hardware. One may also implement LPM utilizing a processor pipeline. Such an approach potentially has high throughput and additionally may also be modified to meet the requirements of evolving standards. This section considers a pipelined LPM algorithm based on the work of Moestedt and Sjodin [12]. In their paper, a dedicated pipeline of special-purpose hardware is presented. Our implementation uses a pipeline of general-purpose processors and is based on the development of a routing tree that contains three types of nodes: 1. Valid route nodes. Tree leaf nodes that correspond to legal or valid routes (or destinations). Associated with these nodes are the router output port information. 2. Invalid route nodes. Tree leaf nodes that correspond to invalid routes. 3. Part route nodes. Tree interior that represent branching nodes in the tree and correspond to part of a prefix. Consider an example (see Figure 11.5) containing three prefixes embedded within a three-level tree. The first, leaf nodes labelled (1), is of length 3 and corresponds to the prefix 001. The second, leaf nodes labelled (2), is of length 7 and corresponds to prefix 0010001. The third, leaf nodes labelled (3) is of
233
11
234
Pipeline Task Scheduling on Network Processors
000 001 010 1
011
00 01
100
00
10
101 01
10
2
1 110 1 111 1
11 1
1 3 000 11
001 010 011
3 3 3
100 Invalid route 101
N Valid route representing prefix number N
110
Part of route 111 Level 1
11.5
Level 2
Level 3
Example of longest prefix match tree.
FIGURE
length 3 and corresponds to prefix 110. Note that with this LPM algorithm, smaller prefixes (e.g., 001) that are themselves contained in longer prefixes (e.g., 0010001) may spawn additional levels and Valid Route Nodes (e.g., Level 3 nodes corresponding to the first prefix). Constructing the tree itself requires
11.4
A Network Processor Problem
that one first decide on the number of levels desirable, and the number of bits to be considered at each level. Thus, there are numerous tree and associated data structures that satisfy the routing table requirements and have differing memory requirements. Given a packet destination address and our routing tree structure, obtaining the route involves successively accessing the tree following the path associated with this address. Say we have the address [0010 0111 X] where X corresponds to the remaining 24 bits in a 32-bit address. Reaching leaf node (2) in Level 3 requires three tree lookups following the wide path in Figure 11.5. An approach to pipelining such an algorithm is to associate each level lookup with a separate task and allocate these tasks to stages in a processor pipeline. In our example a three-processor pipeline would be used, one stage for each level. With sufficient data memory bandwidth, this would increase the throughput of a packet stream by a factor of three over a single processor implementation. Alternatively, two of the levels could be combined and implemented on a pipeline stage resulting in a two-stage pipeline. This might be desirable if many of the lookups terminated at a given stage (as is often the case) thus resulting in fewer lookups in later stages. Using the earlier terminology, this example contains a single flow consisting of three computationally identical tasks where differences in execution times result from the number of memory accesses associated with each of the processor stages. A tool called SimplePipe4 was used to evaluate tree level/task to pipeline stage assignments for a problem having 117,212 prefixes obtained from a Sprint network router (AS1239) [14, 15]. From this set, a five-level tree was constructed with each successive level handling 14, 5, 3, 2, and 8 bits. The entire tree contains 504,857 nodes. Traffic was modelled as a set of 5000 successive routing requests where the distribution of request prefixes followed those empirically obtained in Refs. [16, 17]. With a five-level tree, a separate task (with its task time) may be associated with accessing each tree level. These times are given in Table 11.3. Initially, Greedypipe was used to obtain the best assignment of these five tasks to processor pipelines of different lengths (1 to 16 stages). This is shown in the top graph in 4. SimplePipe is a pipeline simulation tool based on SimpleScalar [18, 19]. An in-order processor with a clock rate of 1 Ghz was modeled. All processor stage caches were taken to be of equal size with the instruction cache being sufficiently large to hold the entire stage program. A 4-way associative, 8-KB data cache was assumed (No L2 cache was present) with off-chip memory latencies set at 20 processor clock cycles. The off-chip memory was taken to be structured as a set of independent banks, one for each of the processor stages where each bank holds data associated with the tasks assigned to it. A fixed stage-to-stage communications delay of 10 ns was also assumed. Data for encryption and compression was obtained directly from SimpleScalar with the same parameters indicated in the text.
235
11
236
11.3
Pipeline Task Scheduling on Network Processors
Appl:Task #
Task 1
Task 2
Task 3
Task 4
Task 5
LPM:5
.028
.040
.026
.020
.014
ENCR:11
17.4
≈ 11.4 per task for up to 10 tasks
COMP:15
≈ 9.44 per task for up to 14 tasks
Task times (µsec) for LPM, ENCRyption, and COMPression.
TA B L E 3.36e + 07 Pipeline throughput (packets/sec)
1.68e + 07
LPM throughput Compression throughput Encryption throughput
8.39e + 06 4.19e + 06 2.10e + 06 1.05e + 06 5.24e + 05 2.62e + 05 1.31e + 05 6.55e + 04 3.28e + 04 1.64e + 04 8.19e + 03 4.10e + 03 1
11.6 FIGURE
2
3
4
5
6
7 8 9 10 11 12 13 14 15 16 # Stages in pipeline
Throughput vs. number of stages (a single separate pipeline for each application).
Figure 11.6 where, at four stages, the maximum throughput is achieved (note that task 2 takes the longest time) and remains constant after that.
11.4.2
AES Encryption—A Pipelined Implementation Encryption involves transforming unsecured information (plaintext) into coded information (ciphertext) with the process being controlled by an algorithm
11.4
A Network Processor Problem
237
and a key. The Advanced Encryption Standard (AES) considered here is an iterated block cipher that supports independently specified variable block and key lengths (128, 192, or 256 bits). The algorithm has a variable number of iterations dependent on the key length. For 128-bit blocks, considered here, ten iterations are required [20]. Algorithm transformations operate on an intermediate result called the State, which is defined as a rectangular array with four rows and Nb columns (Nb = block.length/32). Transformations treat the 128-bit data block as a fourcolumn rectangular array of four-byte vectors. The data block along with the 128-bit (16 B) plaintext is interpreted as a State. The cipher key is also considered to be a rectangular array with four rows, the number of columns Nk being the key length divided by 32. The number of rounds (iterations) depends on the values Nb and Nk and, for 128-bit blocks, is ten. The algorithm consists of an initial data/key addition, then nine round transformations, followed by a final round. The Key schedule expands the key entering for the cipher so that a different round key is created for each Round Transformation. Such transformations are comprised of the following four operations: ByteSub Transformation, a shiftRow Transformation, a MixColumn Transformation, and a Round Key Addition. The final round is similar to earlier rounds except that the MixColumn(State) operation is omitted. Details associated with these operations can be found in Ref. [21]. An overview of a single processor/stage implementation is shown on the left side of Figure 11.7, while a pipelined implementation employing loop unrolling [22] is shown on the right side of the figure. As indicated, each iteration can be associated with a pipelined task. At the end of each iteration, both the schedule and the transformed packet are passed
Key schedule 128 bit key
128–bit data block init
Task1
Key scheduling and data block initialization
Task2
Round logic block – Round 1
Task3
Round logic block – Round 2
Task11
Round logic block – Round 10
Schedule & packet
R=0
Schedule & packet
Round logic block R < 10 Cipher
Cipher Iterative looping
11.7 FIGURE
Pipelined implementation
AES algorithm implementation block diagram.
11
238
Pipeline Task Scheduling on Network Processors
from one stage to the next. The task times associated with each of the eleven tasks are shown in Table 11.3. The values were obtained from a SimpleScalar [18, 19] simulation of the algorithm. Figure 11.6 shows the throughput obtained when Greedypipe is used to assign only these eleven tasks to pipelines of different lengths. Notice that since there are eleven tasks, after the pipeline length reaches a length of eleven there is no throughput improvement. Furthermore, note that given the much higher computational complexity of encryption vs. LPM, the throughput that can be achieved is between two and three orders of magnitude less for pipelines of equivalent length.
11.4.3
Data Compression—A Pipelined Implementation With data compression (DC) a string of characters is transformed into a new reduced-length string having the same information. In this way the bandwidth required for passing a given string is reduced. LZW is a DC method that takes advantage of recurring patterns of strings that occur in data files. The original LZW method was created by Ziv and Lempel [23] and was further refined by Welch [24]. LZW is a “dictionary”-based compression algorithm that encodes input data by referencing a dictionary. Encoding a substring requires that only a single code number corresponding to that substring’s dictionary index be written to the output file. LZW starts out with a dictionary of 256 characters (for the 8-bit case) and uses that as the “standard” character set. Data is then read one byte at a time (e.g., “p,” “q,” etc.) and encoded as the index number taken from the dictionary. When a new longer substring is encountered (say, “pq” and later say “pqr”), the dictionary is expanded to include the new substring and an index is associated with the new substring. The new index is then used when the new substring is encountered. To quickly associate substrings with indicies, hashing techniques may be employed and to limit memory size, a limit can be placed on maximum number of dictionary entries (say, 1024). With a straightforward pipelined LZW implementation, the input data is partitioned and successive pipeline stages operate on successive portions of the input data. Thus, if there are N stages in the pipeline, and a block of B bytes requires compression, each stage operates and compresses B/N successive input data bytes. Figure 11.8 shows the pipelined implementation. The partially compressed data block and the updated dictionary are made available to every stage of the pipeline and the resulting system throughput increases almost linearly with N. Task times (found using SimpleScalar) associated with each task for a 14-task implementation are shown in Table 11.3 and Figure 11.6 shows the throughput
11.4
A Network Processor Problem
239 Uncompressed block (B bytes)
Task1
LZW logic block
Dictionary
0 – B/N bytes compressed
Data block
Task2
B/N – 2B/N bytes compressed
TaskN
(N+1) * B/N – N*B/N bytes compressed Compressed block
11.8
Pipelining of the LZW algorithm.
FIGURE
obtained when Greedypipe is used to assign these tasks to pipelines of different lengths.
11.4.4
Greedypipe NP Example Design Results The results shown in Figure 11.6 are for single flows executing on a single pipeline. The problem becomes more complex when: 1. Multiple flows are considered with some flows requiring multiple applications. 2. Some of the flow applications/tasks are shared and are to be instantiated only once. 3. Multiple processor pipelines are present with a constraint on the total number of stages available. This more complex problem is now considered. Table 11.3 shows the execution times per task for the three applications where the ouput from one task becomes the input to the next task. Assume that we have the three flows indicated earlier. We assume that packets associated with flow 1 (LPM) are 40 bytes long while packets from flows 2 and 3 (LPM and Encryption; LPM and Compression) are 1500 bytes long. Three experiments were performed: 1. Single pipeline. A single 16-stage pipeline was considered where the LPM application was shared among the three flows.
11
240
Pipeline Task Scheduling on Network Processors
2. Two pipelines. Two pipelines were used where the length of each pipeline was varied between 1 and 15 stages, with the sum of the stages being set 16. Flow 1 was assigned to one pipeline and its LPM application was not shared with the other flows. Flows 2 and 3 were assigned to the second pipeline and the LPM application was shared between them. 3. Three pipelines. Three pipelines were used, one flow assigned to each, where the length of each pipeline was varied between 1 and 14 stages with the sum of the stages set to 16. Consider the first case where there is a single 16-stage pipeline. Entering the set of flows, their respective applications, and the application task sequence and times, Greedypipe performs the assignment process and obtains an assignment that maximizes the total throughput. The results in this case are that LPM (shared across all flows) is assigned to stage 1 while the encryption and compression components of flows 2 and 3 are assigned to and share the remaining stages. Table 11.4 shows the resulting best overall bandwidth (both in Gbp and packets/second, Ps) for each flow, and the length of each pipeline for the two-and three-pipeline cases. Note that with a single pipeline, the bandwidth is constrained by the longest latency task which, in this case, corresponds to encryption, which has the highest computational complexity. In the second case, the 16 stages are divided into two pipelines with their lengths not necessarily being equal. Greedypipe was executed iteratively with a different number of stages assigned to each pipeline and, for each pipeline length selection, a near optimal assignment obtained (see Figure 11.9). For pipeline 1 (Flow 1, LPM), the results are the same as seen in Figure 11.6 since there is a single pipeline associated with that flow. As the number of stages for pipeline 1 increases however, the number remaining for pipeline 2 decreases. The result is that there are increased delays for flows 2 (LPM and Encryption) and 3 (LPM and LZW). Greedypipe yields the near optimum task allocations for each
Number
Flow 1 Rate
Flow 2 Rate
Flow 3 Rate
Pipe 1
Pipe 2
Pipe 3
Pipelines
(Gbps /Pps)
(Gbps / Pps)
(Gbps /Pps)
Length
Length
Length
1
0.018/5.7 × 104
0
0
2
7.94/2.4 × 107
0.68/5.7 × 104 0.68/5.7 × 104
4
12
0
3
7.94/2.4 × 107 0.525/4.3 × 104 0.48/4.0 × 104
4
6
6
11.4 TA B L E
0.68/5.7 × 104 0.68/5.7 × 104 16
Bandwidths for best assignments (Pps = Packets/sec; Flow 1, 2, and 3: packet lengths = 40 B; 1500 B).
11.4
A Network Processor Problem
241
3.36e + 07 1.68e + 07
Pipeline throughput (packets/sec)
8.39e + 06
Pipeline 1 - LPM Pipeline 2 - LPM + Enc/LPM + Comp
4.19e + 06 2.10e + 06 1.05e + 06 5.24e + 05 2.62e + 05 1.31e + 05 6.55e + 04 3.28e + 04 1.64e + 04 8.19e + 03 4.10e + 03 1, 15 2, 14 3, 13 4, 12 5, 11 6, 10 7, 9 8, 8 9, 7 10, 6 11, 5 12, 4 13, 3 14, 2 15, 1 # Stages in pipeline 1, pipeline 2
11.9 FIGURE
Two pipelines: Throughput vs. num. stages (X, Y → X stages for Pipe 1 and Y stages for Pipe 2).
of these pipeline lengths. Note that for pipeline 2, the bandwidth is dominated by the requirements of encryption. Flow 1 bandwidth is significantly improved since, having its own pipeline, its packets are now not limited by the delays for encryption. For the third case, the 16 processor stages are shared across three pipelines with flows 1, 2, and 3 being assigned to pipelines 1, 2, and 3, respectively. Using Greedypipe, all partitioning of the 16 stages across three pipelines were considered and evaluated. The results indicate that assigning four stages to pipeline 1 (flow 1, LPM) achieves the highest throughput. Figure 11.10 plots results associated with pipeline 1’s length equal to four stages. As more of the remaining 12 stages stages are assigned to flow 2 there are fewer for flow 3. Given the pipelined implementations of encryption and compression, the crossover point on the graph corresponds to the highest throughput result (also see Table 11.4). Due to the fact that the length of pipelines 2 and 3 are constrained, the full effect of pipelining the flows 2 and 3 cannot be realized and the maximum throughput for those flows is less than the two pipeline case.
11
242
Pipeline Task Scheduling on Network Processors
3.36e + 07 1.68e + 07
Pipeline 1 (LPM) Pipeline 2 (LPM + Comp) Pipeline 3 (LPM + Enc.)
Pipeline throughput (packets/sec)
8.39e + 06 4.19e + 06 2.10e + 06 1.05e + 06 5.24e + 05 2.62e + 05 1.31e + 05 6.55e + 04 3.28e + 04 1.64e + 04 8.19e + 03 4.10e + 03 1, 11
11.10 FIGURE
11.5
2, 10
3, 9
4, 8 5, 7 6, 6 7, 5 8, 4 # Stages in pipeline 2, pipeline 3
9, 3
10, 2
11, 1
Three pipelines: Throughput vs. num. stages for pipelines 2 and 3; Pipeline 1 = 4 stages.
CONCLUSIONS This chapter has presented Greedypipe, a heuristic approach that quickly performs near optimal task-to-pipeline stage assignments in a multiple flow, multiple pipeline environment. Though obtaining optimal assignments is an NP hard problem, Greedypipe obtains very good assignments quickly. While Greedypipe can be used in obtaining throughput performance in a variety of pipelining environments, its development was motivated by the growing importance of embedded processor systems that are based on having multiple pipelines on a single chip. This sort of design is used extensively in current network processors. In addition to presenting the Greedypipe heuristic, the chapter also illustrates how Greedypipe can be employed as a design tool in situations where the effects of pipeline depth, task sharing, and task partitioning are to be explored as part of the design process. This is done for both a generic design case, and a case where
References
243 three flows are present, each flow requiring one or more networking functions (Longest Prefix Match, Encryption, and Compression). The chapter shows how Greedypipe can be used to obtain good assignments in an environment where the total number of processor stages in all the pipelines is constrained (e.g., 16). Work is continuing on refining Greedypipe. In particular, we are now exploring methods for incorporating into Greedypipe performance issues related to memory constraints and contention, and in using both dynamic programming and statistical optimization techniques (e.g., simulated annealing) to obtain optimal solutions.
ACKNOWLEDGMENTS This research has been supported in part by National Science Foundation grant CCR0217334.
REFERENCES [1]
IBM Corp. “IBM Power Network Processors,” www-3.ibm.com/chips/products/wired/products/network_processors.html, 2001.
[2]
Intel Corp. “Intel IXP 2800 Network Processor,” developer.intel.com/design/network/products/npfamily/ixp2800.htm, 2003.
[3]
John Marshall, “Cisco Systems—Toaster2,” Network Processor Design, Vol.1, by P. Crowley, M. Franklin, H. Hadimioglu, and P. Onufryk, Morgan Kaufmann Pub., Inc.," San Francisco, California, 2003, 235–248.
[4]
M. A. Franklin and T. Wolf, “A network processor performance and design model with benchmark parameterization,” Network Processor Design, Vol.1 [Also in Network Processor Workshop (HPCA-8), Cambridge, MA., February 2002], Morgan Kaufmann Publishers, Inc., San Francisco, CA., 2003, pp. 117–139.
[5]
M. A. Franklin and T. Wolf, “Power considerations in network processor design,” Network Processor Design, Vol.2 [Also in Network Processor Workshop (HPCA-9), Anaheim, California, February 2003], Morgan Kaufmann Publishers, Inc., San Francisco, California, 2003, pp. 29–50.
[6]
Tilman Wolf and Mark A. Franklin, “Locality-aware predictive scheduling of network processors,” Proceedings 2001 IEEE Inter. Symp. on Performance Analysis of Systems & Software, Arizona, November 2001, pp. 152–159.
[7]
T. Wolf and M. A. Franklin, “Design tradeoffs for embedded network processors,” Proceedings of Inter. Conf. on Architecture of Computing Systems (ARCS) (Lecture Notes in Computer Science), Karlsruhe, Germany, April 2002. vol. 2299, pp. 149–164.
11
244
Pipeline Task Scheduling on Network Processors
[8]
P. Chretienne, Jr., E. G. Coffman, J. K. Lenstra, and Z. Liu, Scheduling Theory and Its Applications, John Wiley & Sons, Chichester, England, 1995.
[9]
A. S. Jain and S. Meeran, “Deterministic job-shop scheduling: Past, present and future,” European Jrnl. of Operational Research, 1999, vol. 113, pp. 390–434.
[10]
M. Schwehm and T. Walter, “Mapping and scheduling by genetic algorithms,” Conference on Algorithms and Hardware for Parallel Processing, 1994, pp. 832–841.
[11]
V. Sarkar and J. Hennessey, “Compile-time partitioning and scheduling of parallel programs,” ACM SIGPLAN ’86 Symp. on Compiler Construction, 1986, pp. 17–26.
[12]
A. Moestedt and P. Sjodin, “IP address lookup in hardware for high-speed routing,” Hot Interconnects, August 1998, pp. 31–39.
[13]
M. Waldvogel, G. Varghese, J. Turner, and B. Plattner, “Scalable high-speed prefix matching,” ACM Transactions on Computer Systems, November 2001, vol. 19, pp. 440–482.
[14]
“AS1239 BGP Table Data,” bgp.potaroo.net/1239/bgp-active.html, 2003.
[15]
“BGP Table Data,” bgp.potaroo.net/, 2003.
[16]
“CAIDA: The Cooperative Association for Internet Data Analyses,” www.caida.org.
[17]
K. Claffy, G. Miller, and K. Thompson, “The Nature of the Beast: Recent Traffic Measurements from an Internet Backbone,” Technical Report, April 1998.
[18]
T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An infrastructure for computer system modelling,” IEEE Computer, February 2002, pp. 59–67.
[19]
V. Joshi and M. A. Franklin, “The SimplePipe Toolset Manual,” Technical Report 79, Dept of Computer Science and Engineering, Washington University in St Louis, Missouri, 2003.
[20]
“AES Algorithm Rijndael Information,” csrc.nist.gov/CryptoToolkit/aes/rijndael.
[21]
V. Rijmen and J. Daemen, “The block cipher Rijndael,” Proceedings of the Third International Conference on Smart Card Research and Applications, CARDIS ’98, LNCS 1820, 2000, 277–284.
[22]
P. Chodowiec, P. Khuon, and K. Gaj, “Fast implementations of secret-key block ciphers using mixed inner- and outer-round pipelining,” ACM SIGDA Inter. Symp. on Field Programmable Arrays (FPGA ’01), Monterey, California, February 2001, pp. 94–102.
[23]
J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,” IEEE Trans. Information Theory, 1978, vol. 24, pp.530–536.
[24]
T. A. Welch, “A technique for high-performance data compression,” Computer, 1984, vol. 17, pp. 8–19.
12 CHAPTER
A Framework for Design Space Exploration of Resource Efficient Network Processing on Multiprocessor SoCs Matthias Grünewald, Jörg-Christian Niemann, Mario Porrmann, Ulrich Rückert Heinz Nixdorf Institute and Department of Electrical Engineering, University of Paderborn, Germany
In networking applications, a large number of users utilize different kinds of communication services, resulting in a large number of independent, parallel packet flows between their stationary or mobile devices. Additionally, network protocols are constantly changing, and the need to quickly adapt to these changes has resulted in the development of network processors (NPUs). NPUs contain a set of programmable and application-specific execution units that are connected via an efficient interconnect structure. Most commercially available NPUs are targeted for wired core and edge network applications, hence they are optimized for throughput. Research mainly concentrates on modeling and benchmarking existing commercial NPUs. However, there exist only a few approaches that aid the design of NPUs as a multiprocessor-based system-on-chip (SoC). This chapter presents a network packet-processing framework that focuses on the processor parallelism available and how it can be exploited to improve performance. The framework contains a programming model that allows a parallelizable description of the protocol stack, a system model to describe the hardware architecture of the multiprocessor, and a nonblocking scheduling strategy for executing the implementation on the multiprocessor. The software can be developed independently from the hardware with the help of a software library that contains interface functions to abstract the underlying hardware. The interface functions can also be used to obtain an integration with existing network simulators such as ns-2 [1] to allow a verification of the implementation.
246
12
A Framework for Design Space Exploration
With the use of special profiling support in the library, the resource requirements of the implementation can also be characterized. As the assignment of protocol functions to execution units can be a time-consuming and performance-critical task, the framework also provides a method to find an energy- or delay-optimized assignment in an automated way. Finally, a component-based resource consumption model is provided to assess the quality of the software-hardware combinations. This allows exploration of the design space created by different system parameters. In the System & Circuit Technology research group at the Heinz Nixdorf Institute, the framework is used to aid in the design of a network processor for wired core and edge networks as well as for mobile networks. The GigaNetIC project [2] aims at developing high-speed components for networking applications based on massively parallel architectures. A central part of this project is the design, evaluation, and realization of a parameterizable network-processing unit. The proposed architecture is based on massively parallel processing, enabled by many processors, that form a homogeneous array of processing elements arranged in a hierarchical system topology with a powerful communication infrastructure. Four processing elements are connected via an on-chip bus to a “switch box” (Section 12.2.2), that allows an arbitrary forming of on-chip topologies. Hardware accelerators support the processing elements to achieve a higher throughput and help to reduce energy consumption. Following a top-down approach, network applications are analyzed and partitioned into smaller tasks. The tasks are mapped to dedicated parts of the system, where a parallelizing compiler exploits inherent instruction level parallelism. The hardware has to be optimized for these programming models in several ways. Synchronization primitives for both programming hierarchies have to be provided and memory resources have to be managed carefully. Another target application of this design method is the processing of protocols in mobile networks. Based on the theoretical and simulational studies in Refs. [3, 4], a new medium access protocol for mobile ad hoc networks (MANETs) has been developed. Every network node can transmit in several directions in parallel, dividing the space in k sectors. Additionally, the nodes can adjust the transmission power in each sector separately and can receive and transmit simultaneously. The usage of such advanced transmission technologies can result in less energy consumption due to the power-variable transmission. It can also lead to higher throughput and capacity due to the applied space multiplexing technique. However, an increased amount of processing power is required. A project within the Collaborative Research Center 376 aims to design and implement a low-power multiprocessor for mobile ad hoc networking applications. In this chapter, an implementation of the MANET medium access protocol is used to exemplarily assess the performance and resource consumption of two different
12.1
Related Work
247
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
PE S-Core, memory, HW accelerators, transceivers, short lines
12.1 FIGURE
SB
SB
SB
SB
SB
SB
SB
Switch box memory, transceivers
On-chip network long lines
PE processing grid (hardware architecture A) vs. the four PE cluster architecture proposed by the GigaNetIC project (hardware architecture B).
hardware architectures (See Figure 12.1) with up to 64 processors. Generally, the PEs are arranged in a grid on the chip. Architecture A connects one processor via one routing node to the on-chip network. Architecture B is based on the GigaNetIC architecture where four processors are connected to one routing node. This chapter presents a detailed description of the main parts behind this framework. In Section 12.1, an overview is given about the related work. Section 12.2 describes the models and Section 12.3 explains the scheduling strategy. In Section 12.4, the mapping algorithm is explained. The resource consumption estimation of the obtained mapping is described in Section 12.5. Quantitative and qualitative results of a detailed design space exploration experiment are discussed in Section 12.6. Finally, Section 12.7 concludes the chapter.
12.1
RELATED WORK The hardware architecture of the multiprocessor uses switch boxes (SBs) to connect the processors in a Network-on-Chip (NoC). SBs are comparable to routers
248
12
A Framework for Design Space Exploration
in computer networks and enable a packet-based communication scheme. The usage of NoCs can solve many of the design problems found when implementing SoCs with deep submicron fabrication technologies [5]. NoCs can structure interconnect lines and hence allow a better prediction of their electrical properties, for example, delay and power requirements. The wires can also be reused for communication between different units, hence the number of wires can be reduced and their utilization can be increased. The input and output buffers also decouple the functional units, hence a locally synchronous globally asynchronous design style can be applied, reducing the problem of clock distribution. Since the interconnect topology can be arbitrary, NoCs scale much better than bus topologies. However, there also exist new design challenges. Compared to traditional bus-based interconnects, NoCs introduce an increased delay due to the multihop communication. Compared to LANs and WANs, the buffer space inside the routing nodes has to be kept low since on-chip memory has large area requirements. There already exists a large variety of hardware architectures for NoCs. They differ in the applied connection topology (folded torus [6], fat-tree [7], grid [8, 9, 10], butterfly fat-tree [11]), area requirements, and the communication services provided. However, none of these approaches has yet classified the influence of the specific communication characteristic of the application on the performance of the NoC. In this chapter it is shown how to close this gap. For mapping network applications to multiprocessor SoCs, there exist only a few comparable approaches. Crowely and Baer have presented a modeling framework for NPUs [12] that use the Click Modular Router to describe the application. Their framework is able to model devices (e.g., processors) and channels (e.g., memory and I/O busses) of the system. The framework has been used to analyze the performance of PC-based systems for packet forwarding in terms of throughput and delay. However, it lacks a method to map protocol functions to devices in an automated way. The method in this chapter takes the special design problems of SoCs such as area and power consumption into account. Franklin and Wolf [13, 14] have developed an analytical performance and power consumption model for an NPU that consists of a number of processor clusters where each cluster has its own external memory. The received packet flows are uniformly distributed by a packet demultiplexer. This approach allows a more fine-grained distribution of protocol parts to processors that are equipped with smaller, but faster and less energy-consuming on-chip memory. Thiele et al. [15] describe packet processing with a task graph and use a system model based on network calculus to obtain the delay and buffer usage characteristics of the application. The mapping of the task graph to the system components is found by using evolutionary search techniques.
12.2
Modeling Packet-Processing Systems
Their approach has a high degree of generality and can be applied to many different heterogeneous system architectures. This approach is predominantly tailored for the design of a flexible hardware architecture where the tradeoffs between different system parameters need to be analyzed. Therefore, a detailed component-based resource consumption estimation model for delay, area, energy, and power is provided. It allows us to find out which component (e.g., processor, communication link, memory) is the bottleneck in order to improve the software and hardware. Finally, Memik and Magione-Smith present a framework called NEPAL [16] that allows a mapping of tasks to processors during runtime. It frees the designer to partition the implementation by hand. The advantage of the static scheduling and mapping approach described in this chapter is that the performance of the system is known in advance while the partitioning is also done automatically.
12.2
MODELING PACKET-PROCESSING SYSTEMS As mentioned in the beginning of the chapter, the application program and the hardware architecture of the multiprocessor are modeled in this framework. This section describes the two models and gives examples.
12.2.1
Flow Processing Graph The packet-processing modeling technique is based on a task graph model, as proposed by Thiele et al. [15]. The functionality that the SoC has to provide is described by a flow processing graph FPG = (M,S) (see Figure 12.2). The nodes m ∈ M represent packet methods. The edges s ∈ S represent flow segments. The packet methods generate and process the packets, and the flow segments describe how the packets are forwarded between the methods. Each flow segment has a specific bandwidth Bs requirement as well as a minimum packet size pmin and a maximum packet size pmax . An instance l ∈ L of a protocol layer consists of a set of methods Ml ⊂ M that share a common state Z. The state contains data structures that are shared by the methods of the corresponding layer. It can also be used to store information if the packets of the flow segment are not independent. For example, packet reordering in the transport layer can be implemented by storing a packet queue inside the state that is sorted according to the sequence number of the received packets. Once all packet fragments have been collected, the complete packet can be forwarded to the application layer. A packet flow pf ∈ PF is a list of flow segments that describes
249
12
ipp
SectorMAC 1
0.17/26
0.88/148
opp
NeighborMAC B
SectorMAC 2
1.25 ms
Sector- MAC 8 Local layers
125.22 ms
FIGURE
opp 3.12k (0.91)
ipp
PORT 2
4.73k (0.36)
State S
1/
44
ipp 9.35k (0.46)
6 0.58/2
1.11/40...156
PORT 1
1.44/52...168
PHY 2 1/36...152
PHY 1
0.1
0.13/32...52
4.27k (0.26)
1.44/52...168
12.2
bpp 6.18k (0.55)
44
1/
0.
bpp
Global layers
NET
A Framework for Design Space Exploration
1.01/154
250
State S
a1/a2 . . . a3 Flow segment ipp opp name 2.34k (0.61) Packet 4.02k (0.34) b1(b2) method A Layer ... PORT 8 Logical ports Packet flow
PHY 8
Annotated flow processing graph of a medium access protocol for directed mobile ad hoc networks.
the way of a packet flow through the flow processing graph. The set Spf contains the flow segments that pf consists of. Each protocol layer can have an arbitrary number of methods. Usually, three methods are used to describe a layer. Two methods are necessary to process packets from inbound flows (that traverse up the protocol stack) and from outbound flows (that traverse down the protocol stack). In this chapter, these methods are denoted by ipp (input packet processing) and opp (output packet processing). The former two methods are called when packets are received that have been generated by other packet methods. However, in most networking applications, time-dependent events need to be processed, too. For this purpose, the method bpp (backlog packet processing) is used. It operates on packets that are stored for later processing at user-defined time points. For example, it can be used for retransmitting packets after a time-out or for implementing flow-independent maintenance tasks such as the routing table cleanup. To specify the period of
12.2
Modeling Packet-Processing Systems
the bpp methods, virtual flow segments are defined whose minimum packet size and bandwidth determine the calling period of the method. If a packet flow pf is generated by a virtual flow segment, pf also contains the corresponding virtual flow segment. There also exist special layers lp ∈ Llp ⊂ L called logical ports that denote the sources and sinks of the packet flows that are processed by the system. In this chapter, the term port data rate is used to denote the data rate at which the external packet flows enter and leave the system.
Example 1: A Level-2 Ethernet Switch Exemplary physical (PHY) and medium access control (MAC) layers for an Ethernet switch could be implemented as follows: The logical port denotes the Ethernet transceiver that receives the packet. The transceiver would forward the packet along a flow segment to the ipp method of the physical layer. The bandwidth of the flow segment would be equal to the data rate of the transceiver, for example, 100 Mbps. The minimum and maximum packet sizes would be set to the values supported by the transceiver. The physical layer would process the packet by executing the ipp method. For example, error detection based on a cyclic redundant code (CRC) could be performed. The method could then remove any control information used by the transceiver, for example, the frame check sequence for the CRC computation. It could also store the payload of the packet in a queue in its state and forward only the headers to the next protocol layer. For this purpose, the ipp method has to add a new internal control header to the stripped-down packet to be able to reassemble the payload before the packet leaves the system. The modified packet would then be forwarded to the next higher layer, that is, to the ipp method of the medium access layer. The bandwidth of the flow segment could be lowered since the modified packet contains less bits (the frame check sequence has been removed). The medium access layer could then further modify the packet. In an Ethernet switch, the layer has to decide on which port the packet has to be forwarded, according to its destination MAC address. With a table stored in the state of the medium access layer, the MAC layer can relate the MAC addresses to the appropriate destination port. The method would then forward the packet to the opp method of the physical layer that processes the outgoing packet for the destination port. The opp method would have to request the payload part of the packet from the ipp method of the physical layer that has received the packet. Hence, there would be flow segments between each opp method and all ipp methods of the physical layers. To prevent the opp method from blocking the processor until the
251
252
12
A Framework for Design Space Exploration
payload is received, the method should store the outgoing packet in a queue in its state. If the payload is received, the opp method is called again. It collects the appropriate packet from the queue in its state, adds the received payload, generates the frame check sequence for the CRC, and forwards it to the logical port that represents the Ethernet transceiver. The disadvantage of this implementation would be that much internal communication would occur. However, NoCs usually provide enough communication bandwidth for this purpose.
Example 2: Topology control for directed MANETs An exemplary packet flow graph of an already implemented network application is shown in Figure 12.2. The graph models a novel medium access protocol for mobile ad hoc networks [3, 4]. The graph consists of one local layer for each sector (Sector-MAC) and a global layer (Neighbor-MAC) that maintains a list of neighbors. In each sector, the network node maintains a link only to the best suited neighbor, that is, the one that requires the least transmission power. In a periodically occurring time window, other nodes can send requests to change the link if they are better suited. The local and global layers synchronize if network changes occur, that is, the neighbor list in the global layer is updated accordingly. If other nodes have moved nearer to the node, they can contact it during a periodically occurring time window. The local layer informs the global one if contact packets have been received from other nodes. If one of the new nodes is better suited, the responsible local layer is updated by the global layer with the address of the new communication partner and the appropriate transmission power. The use of a contact time window has the advantage that an adjustable amount of transmitter bandwidth is used for the medium access protocol and that the bandwidth of the flow segments between the local and global layers can be kept low. In the design space exploration example in this chapter, two packet flows are used for each port (see Figure 12.2). The first one pfA is the flow of a MAC control packet that has been received from a neighbor. It is first received from the physical interface of the sector by the local MAC layer. Its processing by the ipp method results in an update packet that is transmitted to the global MAC layer. The ipp method of the global layer generates an answer packet for the neighbor and submits it back to the local layer for transmission. Since the contact window of the neighbor may not be active at the moment, the packet is put in a packet queue by the opp method. Finally, the bpp method submits the packet back to the physical layer for external transmission if the contact window is reached. The second flow pfB describes the way of packets that are internally generated by the global MAC layer, for example, periodically submitted hello packets to
12.2
Modeling Packet-Processing Systems
find other nodes. They are generated by the bpp method of the global layer and forwarded to the opp method of the local layers. The remaining path is the same as for the previous flow. This medium access protocol has been implemented within an ANSI-Cbased software environment called packet processing library (PPL). Within the PPL, the processing can be described according to the concepts outlined in this section. It has low memory requirements; for example, the size of the program code and data for the example application, compiled for our 32-bit RISC processor S-Core [17] used in the exemplary multiprocessor, is below 70 kB. Therefore, it is possible to fit the complete application into the on-chip memory of the system. The PPL uses interface functions to abstract the underlying hardware. Hence, the packet processing can be mapped to arbitrary systems. An interface to the network simulator SAHNE [4] allows a hardware-independent software development. The numbers in the exemplary flow graph denote the resource requirements of the implementation. They are explained in Section 12.6.
12.2.2
SoC Architecture It has been shown so far how the network application is modeled. The next step is to define a model of the system that executes the application. An instance of such a packet processing system is described by an architecture graph SYS = (G, H) whose nodes g ∈ G represents the used modules and whose edges h ∈ H represent the communication links between the modules (see Figure 12.3). The modules can be divided into three sets GPE ∪ GSB ∪ GPP = G. The set GPE contains every processing engine PE of the system. PEs perform the processing of flow segments. They contain a main processor, embedded memory, hardware assists, and a link interface (transceiver) to connect with the SB. The processor inside the PEs executes the packet methods. The hardware assists are used to speed up the execution. For example, a cyclic redundancy checker (CRC) or a random number generator (RNG) are useful hardware assists for networking. The instructions of the packet methods and the states of the layer instances are stored inside the embedded memory. Please note that PEs cannot access external memory via the on-chip network. Hence, all assigned protocol layers have to be fitted into the on-chip memory of the PE. This excludes some memory-intensive network applications from the field of application of this hardware architecture. However, Section 12.7 presents some ideas on how external memories can be included in this framework. Since caches are not necessary in this hardware architecture, multithreaded processor architectures are also not required to hide external memory latencies. The different components inside the PE are connected as slaves via the main processor bus. An exception is the link interface, which is
253
12
254
PORT 1
A Framework for Design Space Exploration
Processing engine (PE) Main processor (S-Core)
PORT 2
width wc
PE
PORT 3
PP PE
Embedded RAM
2
PE
CRC RNG ...
depth dc
PP
CPU bus
1
PORT 8 width wc
PE PP 4
PE
PORT 7 Processing cluster c
SB
3
Links SB
PE
PE
Hardware TX/ assists RX
PP
PE SB
SB depth dc
FIGURE
PE SB
PP
12.3
SB
SB
PP PORT 4
SB
SB
PP
PP
TX
RX TX
TX Flit buffer RX
PORT 6
PORT 5
LUT
RX
SB
TX RX
A B
RX TX
Logic
A general system instance consists of processing engines (PEs), switch boxes (SBs), and links.
a master with a higher bus access priority than the main processor. It uses the internal memory for reading and writing packet data. Packets are received via the downlink from the SB and are submitted via the uplink to the SB. By applying the scheduling strategy explained in the next section, each PE can process several layer instances and each protocol layer instance l can be assigned to a different PE. The set GSB contains the switch boxes SB of the system. Switch boxes in general are the core components for the on-chip communication [8]. They forward the flow segments between PEs. A switch box can contain a crossbar switch, multiplexers, or shared memory to be able to dynamically form connections between pairs of input links and output links. This approach makes use of a dual-port shared memory (see Figure 12.3) that is connected via arbitrated buses to the transceivers. The communication is established via transceivers (TX/RX), which allow data transfers to the on-chip network neighbors via so-called long links. The arriving packets are stored in dedicated buffers in the shared memory.
12.3
Scheduling
255
In addition, the switch box provides an interface to connect with the local PEs. For this purpose, so-called short links are used that employ the same transceiver logic. For scheduling the forwarding of flow segments, a look-up-table (LUT) and the necessary logic have to be integrated into the SB. Finally, the set GPP contains the physical ports PP of the system. Physical ports denote the physical connections used to inject and receive external packet flows (see Figure 12.3). The logical ports from the flow processing graph are later assigned to physical ports. To each physical port, several logical ports can be assigned as long as the links of the port have enough bandwidth to forward the flow segments. Each border PE has only one nearest physical port to ensure a uniform load at the border. The mapping of logical ports to physical ports and the purpose of processing clusters are explained in Section 12.4.
12.3
SCHEDULING Depending on the mapping of the flow processing graph to the system, several flow segments have to be processed by a single PE and have to be forwarded by a single link. For this purpose, enough computing power has to be assigned to each packet method that processes the flow segment. The share of computing power has to be high enough such that the current packet is processed before the next packet of the flow segment has been received. The same applies for forwarding flow segments between PEs if the next packet method is not assigned to the same PE. Here, the current packet of the flow segment has to be transferred to the target module of the link before the next packet of the flow segment arrives. For coordinating this processing and forwarding, the scheduling algorithm generalized processor sharing (GPS) [18] is used. It allocates for every flow segment s a FIFO buffer1 and a weight φs . The weight describes how much computing power or link bandwidth is necessary for the flow segment. The algorithm guarantees that the flow segments are processed in parallel at an individual service rate B˜ s of φs B˜ s (t) =
˜s∈S(t) φ ˜s
B
(12.1)
where B is the total service rate of the module or link. The set S(t) contains the back-logged segments at time t, that is, the segments whose input FIFO buffer is not empty. The weight φs has to be chosen such that the computing/bandwidth requirements of the flow segments are met. In the worst-case, 1. Please note that this packet queue is mandatory for each flow segment. The packet queues that are mentioned in Section 12.2.1 are additional queues used by the implementation.
12
256
A Framework for Design Space Exploration
input packet bursts have to be processed on all flow segments. To prevent packet queues from overflowing in this case, the weights for GPS are chosen such that the system processes the flow segments at their bandwidth requirements. Hence, the throughput is always guaranteed, even if the system works under full load. However, it is important that the flow segments and packet methods have been characterized before with their correct worst-case resource consumption. Sections 12.3.1 and 12.3.2 show how the weights can be calculated. The delay D(s) that each packet of a flow segment observes depends on the currently available service rate B˜ s and the packet length p (in bits). For the sake of simplicity, it is assumed here that the service rate B˜ s remains constant during the processing of the packet. The resulting delay is: D(s) =
p B˜ s (t)
=
p φ ˜s φs B
(12.2)
˜s∈S(t)
An example that shows how the processing and forwarding of flow segments works is given in Section 12.3.3.
12.3.1
Forwarding Flow Segments Between PEs A practical implementation of GPS is packetized general processor sharing (PGPS). PGPS implements GPS by defining finish times for each packet that has been received from the different flow segments. The packet with the earliest finish time is processed next with the full service rate B. The finish time of PGPS is at most the finish time of GPS plus the time needed to process the largest packet at the maximum service rate B [18]. This is due to the fact that the processing of the current packet cannot be suspended anymore if a new packet arrives that has an earlier finish time. However, if all packets have the same length and their reception is synchronized on all links, the current packet will be finished when the next packet arrives (if equal data rates are used on all links). Hence, the finish times of PGPS are equal to GPS for this case. For every flow segment whose layers are assigned to different modules, a virtual circuit has to be set up through the NoC that supports the bandwidth and delay requirements of the flow segment. The virtual circuit is described by a set Hs of links and a set GSB, s of SBs that are used for forwarding the segment. The packets of the flow segment are divided into data units of fixed size called flits. This is necessary to use the PGPS implementation of GPS without additional delays, caused by varying packet sizes (as discussed before). The necessary fragmentation and reassembly has to be done by the link interface in each PE.
12.3
Scheduling
257
The use of GPS allows the forwarding of several flow segments over the same link with a bounded delay. Compared to the approaches in Refs. [6, 11, 10] GPS scheduling removes the need to use bandwidth reservation or to compute the schedule offline. If no packets are received from one flow segment, a higher service rate B˜ s is available for the other flow segments [Equation (12.1)]. Hence, they can be forwarded with a reduced delay. For this approach, each flit has a size of q bits and contains a header of size qˆ bits that indicates the flow segment to which the flit belongs. It is used inside the SBs to compute the correct finish time. PGPS is used to forward the flits on the links. The weight φs,h that segment s requires on link h is set in a way that the delay of the currently received flit is below the reception time of the next flit from the same flow segment. The weight can be computed from the segment bandwidth Bs and link data rate Bh by taking the flit header into account: φs,h =
1
Bh 1 −
qˆ q
/Bs
(12.3)
The delay D(s, h) per flit is equal to GPS: D(s, h) ≤
q φ ˜s,h φs,h · Bh
(12.4)
˜s∈Sh
where the set Sh contains the flow segments that are forwarded on link h. To ensure that a link is not overloaded, its utilization Uh has to be below one: Uh = φ ˜s,h ≤ 1 (12.5) ˜s∈Sh
This guarantees that the flit forwarding time is below the reception time of the next flit of the segment. Hence, each SB has to hold, at most, two flits per flow segment in its buffer. Since the flits can have a length that is only a fraction of the maximum packet size, the buffer requirements can be kept low. This tackles one important problem of NoC design since on-chip memory has high area requirements. However, the drawback of this method is that the reception time points of all links have to be synchronized in each SB.
12.3.2
Processing Flow Segments in PEs For applying GPS inside the PEs, some formula symbols have to be substituted since computing power is usually defined by clock frequency and execution cycles. Hence, the service rate B has to be replaced by the frequency fPE of
12
258
A Framework for Design Space Exploration
the PE. Additionally, the packet length has to be substituted by the execution cycles ECms ,PE of the packet method ms on the PE. We propose to use a realtime scheduling algorithm such as earliest deadline first to implement GPS. It can suspend the execution of the current packet method if a packet arrives that needs to be processed in order to meet an earlier finish time. Here, the application of a multithreading-capable processor inside the PE can be advantageous to minimize task switching times. The weight φs,PE that a flow segment s requires to be processed has to be set in such a way that the processing time is below the reception time of the next shortest packet ps,min of the flow segment. The processing time depends on the frequency fPE of the PE and the execution cycles ECms ,PE of the packet method on the PE. By knowing the length of the shortest packet and the bandwidth requirement of the flow segment, the necessary weight φs,PE for the GPS scheduler can be computed: φs,PE =
ECms ,PE ps,min · fPE /Bs
(12.6)
The execution cycles can vary from one PE to another due to the different hardware assists utilized. For example, an error detection computation with the CRC algorithm can be significantly faster if the PE is equipped with a hardwareimplemented CRC assist. The delay D(s, PE) caused by the processing can be computed from Equation (12.2): D(s, PE) ≤
ECms ,PE · φ ˜s,PE φs,PE · fPE
(12.7)
˜s∈SPE
where SPE is the set of flow segments that are processed by the PE, including virtual flow segments. The PE must not be overloaded to ensure that the delay per packet is below the reception time of the next packet from the flow segment. Therefore, its utilization UPE has to be below 1: UPE =
φs,PE ≤ 1
(12.8)
s∈SPE
This also limits the buffer space inside the PE to two packets per flow segment: the packet that is processed and the packet that is received next. Exceptions are flow segments that are generated and processed inside the same PE. They require buffer space for only one packet since the PE can either generate or process one packet at the same time.
12.3
Scheduling
If a flow segment is generated and processed at different PEs, the link interface of the source PE has to convert the packets into flits and vice versa. Therefore, these flows require an additional output buffer space of two packets at the source PE that is used by the link interface. As a packet can consist of several flits, the link interface of the PE and SB must either provide a traffic shaping functionality or provide flow control for each flow segment. Otherwise, the PGPS scheduler would transmit all flits immediately to the SB if the uplink is free and the buffer constraint of two flits per flow segment could not be held anymore.
12.3.3
A Scheduling Example In Figure 12.4, a detailed scheduling example is given. In the example, a local layer instance (Sector-MAC) from the exemplary packet flow graph (see Figure 12.2) is assigned to PE 1. The physical layer (PHY) is assigned to PE 2, that is, its packet methods are executed on PE 2. Flow segment B is generated at PE 2 and forwarded via the uplink of PE 2 to SB 2, over link 3 to SB 1 and via the downlink to PE 1. Flow segment D takes the opposite way. There is also a virtual flow segment C that periodically calls the bpp method of the SectorMAC layer. Flow segments A and E arrive/depart via the downlink/uplink of PE 1 over link 1 of SB 1. There are three additional flow segments G, F, and H that are forwarded on link 3. The bandwidth requirements, packet sizes, weights for the GPS scheduler, and the worst-case delay are outlined in the last column of the scheduling diagram. Each flit has a size of q = 128 bits. The flit header size is neglected in this example (ˆq = 0 bits). Packet sizes are given in number of flits and bandwidth is given in flits per time slot. In each time slot, one flit is forwarded, hence the data rate of the links is Bh = 1flit /slot. The frequency of the PE is fPE = 10cycles /slot. The first row of the scheduling diagram shows how flits from flow segments A and B arrive in PE 1. The link interface of PE 1 writes the flits sequentially to the memory of the PE. For each flow segment, a different memory location is used. When a complete packet is received, the GPS scheduler of the PE is informed (e.g., via a processor interrupt). Rows 2–4 show the reception time points and sizes of completely received packets. The real-time GPS scheduler has a list of active flow segments and computes the finish times (deadlines) for each received packet. This includes also the virtual flow segments (here segment C) that do not depend on external packet reception. The finish time of a packet is equal to its reception time plus its worst-case delay, obtained by Equation (12.7). In the example, the index number at the identifier of the packet denotes its finish time. The packet with the earliest finish time is processed next. Row 5 shows how the packets are processed. If a packet has arrived with an earlier finish time than
259
FIGURE
12.4
B G
F
H8
A
E7
E4
F6
B
B3
3
4
E
B
A
5
6
C9
B7
A
H
A F10
F
A
E
B
H13
E11 D13
E8
8 A
D10
A8
7
A B AC
B
9
D
SB 1 3
PE 2
E12
D
H
A
B
B
F
D
B
D
E
H18
H
A
F
A
E19
E16
B15
A
A G22 A F18
D15 E15
D12
B
B11
A
E
B
C
C19
A
B
G
D
H23
D
F
E23
Uh = 0.2
Uh = 1
pA, min = 5, BA = 5/10 fA = 0.15, DA ≤ 3.25 pb, min = 1, BB = 1/4 fB = 0.125, DB ≤ 1.3 pC, min = 1, BC = 1/10 fC = 0.05, DC ≤ 3.25 ECA,PE = 15 ECB,PE = 5 UPE = 0.325 ECC,PE = 5 pA, min = 2, BA = 2/10 fD = 0.2, DD ≤ 2.25 pE, min = 1, BE = 1/4 fE = 0.25, DE ≤ 1.80 DD ≤ 5 Uh = 0.45 DD ≤ 4 BF = 1/4, fF = 0.25, DF ≤ 4 BG = 1/10, fG = 0.1, DG ≤ 10 BH = 1/5, fH = 0.2, DH ≤ 5
SB 2
B
D22 E20
B19
A
A F22
D22
D19
A18
B
10 11 12 13 14 15 16 17 18 19
A F14
B
2
E
A D 1
An example showing the scheduling of the on-chip processing and communication. Each block represents one flit. The associated flow processing graph and architecture graph are shown above the table.
Downlink from SB 2
Output link 3 of SB 1
Input link 3 of SB 1
Input link 2 of SB 1
Input link 1 of SB 1 G10 A
Uplink to SB 1
Output flow segment E
Output flow segment D
2 A
State S
PE 1
12
Processing
Input flow segment C
Input flow segment B
Input flow segment A
1 B
0
Downlink from SB 1 A
Time step
D
PHY (PE 2)
B
A Sector-MAC (PE 1) C E ipp bpp opp
NET
260 A Framework for Design Space Exploration
12.4
Mapping the Application to the System
the one that is currently processed, its processing is suspended and the new one is processed first. In the example, the processing of the packet from flow segment A is suspended since the packet from flow segment B has an earlier finish time. The precomputed weights [Equation (12.6)] ensure that all packets are processed within their deadlines. During the processing, new packets are generated. For example, the bpp method generates a new packet on flow segment D. The processor informs the link interface that new packets have to be transmitted. The link interface performs the GPS scheduling on the uplink to SB 1. According to the precomputed weights for the uplink [Equation (12.3)] and the resulting maximum delay [Equation (12.4)], the flow segment with the earliest finish time is chosen. The next packet data is read from the memory and transferred on the uplink until the flit size is reached. To prevent the flit buffer for the current flow segment in the SB from overloading, the scheduler suspends the forwarding of the active flow segment until the finish time of the current flit is reached. In the example, flow segments E and D are forwarded on the uplink of PE 1 to SB 1. This results in an uplink utilization of 0.45 percent. Each SB performs the same GPS scheduling in parallel for each of its output links. Per output link, all input links including the uplink from the PE have to be regarded. In the example, flits from flow segments D, F, G, and H are multiplexed on link 3 of SB 1 to SB 2. Flow segments A and B are not regarded for output link 3. Link 3 is completely utilized.
12.4
MAPPING THE APPLICATION TO THE SYSTEM We have already clarified how the protocol stack can be executed on the multiprocessor. However, we now need to determine the PE that should execute each layer instance, and which link should forward each of the flow segments. The goal is to find a mapping of layers to PEs that minimize a given cost function while meeting the technology constraints. For this method, the user can choose to either minimize energy or delay per packet. This optimization problem is defined as an integer linear program (ILP). Additionally, a hierarchical approach is used to keep the number of variables small and, with this, the solution time short. The approach exploits the structure of the flow processing graph. In the FPG, there exist local and global layer instances (see Figure 12.2). The local layers are connected to logical ports in a pipelined fashion. These layers are associated with the port. The user defines for every logical port lp a set of local layers Llp,local that contain the layers that work for the port. To exploit the concept of locality, the local layers have to be processed close to the physical port to which the logical
261
262
12
A Framework for Design Space Exploration
port will be assigned to. The global layers are connected to several local layers. The set Lglobal contains the global layers. As the communication between the local and global layers depends on the assignment of the local layers to PEs, it makes sense to bind global layers to PEs after the local layers have been assigned. Details about this mapping strategy can be found in Ref. [19]. To map the application to the system, the border of the system is first divided into processing clusters c that have a width of wc and a depth of dc PEs. The purpose of processing clusters is the reduction of the number of layers and the number of PEs that need to be regarded for solving the optimization problem. The partitioning algorithm is outlined in Figure 12.5. The operator | · | denotes the number of elements in the set and the operator ·n denotes a tuple (list) of size n. The algorithm creates for every cluster a set of PEs GPE,c , a set of SBs GSB,c , and a set of physical ports GPP,c . For each physical port PP, a set of assigned logical ports LPP is also created. In Figure 12.3, each processing cluster has a width of two PEs and a depth of one PE. The figure shows also how the logical ports are assigned to physical ports. In each processing cluster, only the local layers that are related to the physical ports in the cluster are executed. For communication cost computation, a shortest-path algorithm is used where the links are weighted with the resource that has to be minimized (energy or delay for forwarding a maximum-sized packet, see Section 12.5). All local layers that work for a logical port are assigned to the processing cluster the physical port belongs to. In Figure 12.3, the physical and medium access layers 1 and 2 are assigned to processing cluster 1; 3, and 4 to processing cluster 2; etc. In order to be able to compute communication costs between clusters, the local layers are initially assigned to the physical ports. Global layers are not assigned to processing clusters yet. Now, for every cluster, the layers that belong to the cluster are assigned to their PEs by solving an integer linear-optimization problem, defined in Ref. [19]. After all local layers have been assigned to clusters, the global layers (NeighborMAC and NET in this example) are mapped. In this step, all PEs of the system and all layers of the packet flow graph are assigned to a single new cluster. As the position of the local layers is now known, the global layers can be assigned to free PEs by solving the same optimization problem. The result of these mapping steps is a set of layers LPE that are assigned to each PE. From LPE , the set of assigned packet methods MPE per PE can also be obtained. In the last step, the connection between PEs and physical ports are routed through the NoC with a minimum cost path algorithm. For every flow segment s, a set of used links Hs is obtained. From this path of links, the set SBs of the SBs that are used to forward s can also be created. If the constraints of
12.4
Mapping the Application to the System
partition (FPG, SYS) begin % Created list of border PEs: set an arrow cursor to the upper left PE and orientate it to the right; do dc create a list l ∈ GPE and add a line of dc PEs to the right of the cursor to it; GPE,bd := GPE,bd ∪ l; if no border PE at the next cursor position then change cursor direction by 90◦ ; set cursor position to the next border PE that lies along the cursor direction and that is not element of any PE line in GPE,bd ; while cursor can be positioned; % Assign border PEs and associated SBs to clusters c := 0; while GPE,bd <> {} do remove the next wc PE lines from GPE,bd and add the PEs to GPE,c ; for all PE ∈ GPE,c do add the SB the PE is connected to to GSB,c ; c := c + 1; end % Assign physical ports to the nearest cluster for all PP ∈ GPP do compute minimal cost path from PP to all PEs; determine cluster c of the PE that has resulted in the latest communication costs; GPP,c := GPP,c ∪ PP; end % Compute a balanced assignment of logical ports to physical ports sort GPP such that the order of the ports corresponds to a clockwise circulation along the border that starts at the upper left physical port; w := |GPP |/|Llp |; u := 0; v := 1; set PP to first element of GPP ; for all lp ∈ Llp do LPP := LPP ∪ lp; u := u + w; while u ≥ v do set PP to next element of GPP ; v := v + 1; end end end
12.5
The partitioning algorithm.
FIGURE
one of the ILPs cannot be satisfied or there are not enough routing resources left, the mapping fails. Otherwise, it is guaranteed that the system achieves the throughput defined by the port data rates. The whole mapping algorithm is summarized in Figure 12.6.
263
12
264
A Framework for Design Space Exploration
map (FPG, SYS) begin partition (FPG, SYS); for all local clusters c do for all PP ∈ GPP,c do for all lp ∈ LPP do assign all local layers to cluster c: Lc := Lc ∪ Llp, local ; end end {LPE |PE ∈ GPE,c } := solve local ILP to map Lc to GPE,c ; end assign all layers and PEs to a new global cluster ˆ c; {LPE |PE ∈ GPE,ˆ c } := solve global ILP to map Lglobal to GPE,ˆ c; create a sorted list T of all flow segments s ∈ S, highest Bs first; for all s ∈ T do disable for this iteration links that do not have left enough bandwidth or buffer space; Hs := compute minimal {energy, delay} path between source and destination PEs of s; update utilization Uh of all links and buffer usage QSB of all SBs; end estimate resource consumption of mapping; end
12.6
The mapping algorithm.
FIGURE
12.5
ESTIMATING THE RESOURCE CONSUMPTION With the algorithms outlined in the previous section, the mapping of the processing of flow segments in PEs and forwarding between PEs can be found. We now need to estimate the resource consumption of this mapping to be able to compare it with other system configurations. The resource consumption estimations are based on the modeling techniques presented in Ref. [20] and our previous work about NoCs [8] and processing engines [21]. The accuracy of the energy, power, and area estimations is not very high since information from data sheets and first synthesis results are used. However, the coarse approximations allow a comparison of different system configurations in an early design phase, an important aspect for system design. The delay per module depends on the scheduling algorithm applied inside the module that we have already classified. For forwarding a packet through the NoC, the number of flits the packet needs to be split into are also included in the delay computation. The area of a link Ah can be estimated by its length lh and width wh . The area Ag of a module can be estimated by summing up the area of its components. These are memories (Amem ), functional
12.5
Estimating the Resource Consumption
265
units such as the processor, and a number of hardware assists (Afunc ), and the area for the link interfaces (Atx/rx ). Please note that the area consumption for the interconnect structure within the module is neglected in the moment. Additionally, the switch box SB can contain a switch to connect the incoming links with the outgoing. This switch has an area requirement (Asw (|HSB |)) depending on the number of links |HSB |. The average power consumption can be estimated by summing up the power of all modules and links, weighted by their utilization. Please note that the power consumption of idle units is neglected at the moment. For computing the power consumption due to memory accesses from link interfaces, the used proportion φmem of memory bandwidth has to be computed: φmem (g, h) =
q/wg,h,mem · Tg,h,mem q/Bh
(12.9)
where wg,h,mem is the data width of the memory in bits and Tg,h,mem is the access time of the memory. As a module can have multiple link interfaces that access different memories, we use the index g, h, mem to denote the memory that is accessed for link h in module g. Only the memory accesses are used to estimate the power consumption of the link interface, hence contributions from control logic are neglected. The total power consumption can be computed by summing over all links: P tx/rx (g) = φmem (g, h) · Pg,h,mem · Uh (12.10) h∈Hg
where the set Hg contains the links of module g and Pg,h,mem is the active power consumption of the memory. The energy consumption is estimated per packet. The energy consumption for a link depends mainly on reading the packet contents out of the memory in the source module, transferring the data on the link, and storing it in the memory of the destination module. Since a packet is split into flits if it is transferred between PEs, the number of memory accesses depends on the necessary number of flits #q: #q(p, h) =
p q − qˆ
(12.11)
where p is the size of the packet in bits. The energy for reading and writing a packet from/to the memory is estimated by: Emem (g, h, p) = #q(p, h) · q/wg,h,mem · Tg,h,mem · Pg,h,mem
(12.12)
12
266
A Framework for Design Space Exploration
The energy for transferring the packet on a link is estimated by: Eh (p) = #q(p, h) · q · (1/Bh ) · Ph
(12.13)
where Ph is the power consumption for driving the data lines of the link. For the SB, the energy consumption for forwarding a packet is regarded: ESB (p, h) = #q(p, h) · q · (1/Bh ) · PSB,sw
(12.14)
where PSB,sw is the power consumption during packet forwarding through the internal switch for one link. The energy consumption due to the schedule computation is neglected. For the PEs, the energy consumption depends on the packet ˜ m,PE to methods that can have different characteristics. The number of cycles EC execute the packet method m on the processing engine PE is computed by: ˜ m,PE = PCm,PE + MAm,PE · Nwait,PE EC
(12.15)
where PCm,PE are the clock cycles required for executing the packet method m (without any memory wait cycles), MAm,PE are the clock cycles used to access the internal memory of the PE, and Nwait,PE are the wait cycles of the memory in the PE. Since the link interface reads and writes packets from the memory during the execution of the packet method, the real execution time has to be adjusted accordingly. The period Tmem (g, h) with which the transceiver unit has to access the memory per link is: Tmem (g, h) =
wg,h,mem
(12.16)
Bh
Hence, the real execution cycles can be computed by: ˜ m,PE + ECm,PE = EC
˜ EC/( fPE · Tmem (PE, h) − (Nwait,PE + 1))
h∈HPE
× (Nwait,PE + 1)
(12.17)
where the set HPE contains the up- and down-link to the SB. The power consumption depends on the switching activity caused by the packet method executed. Thus, a power value Pm,PE can be defined for each packet method on the PE: 2 Pm,PE = Cm,PE · Udd · fPE
(12.18)
where Cm,PE is the effective capacity switched when method m is executed on the PE and Udd is the operating voltage of the chip. By knowing the power consumption of the memory, the energy consumption of the method can now
Y FL
12.5
Estimating the Resource Consumption
M A E T
267
be estimated: Em,PE = MAm,PE · TPE,mem · PPE,mem + PCm,PE · Pm,PE · (1/fPE )
(12.19)
where TPE,mem and PPE,mem are the access time and the power consumption of the memory used inside the PE. Finally, the buffer requirements per module have to be computed. Since the applied scheduling ensures that only two packets or flits per flow segment have to be stored in the module, the buffer size Q{PE,SB} depends only on the number of assigned flow segments to the module. Table 12.1 summarizes how the resource consumption is computed for a system instance SYS Link h Energy:
Ah = wh · lh , P h = Uh · Ph Eh (s) = Eh (ps,max ) + Emem (src(h), h, ps,max ) + Emem (dst(h), h, ps,max )
Buffer size, delay:
Qh (s) = 0, Dh (s) = φ ·B · Uh · #q(ps,max , h) s,h h
Area, average power:
q
Switch box SB
ASB = ASB,mem + 12 |HSB | · Atx/rx + ASB,sw (|HSB |) + ASB,func
Area: Average power, delay: Energy, buffer size:
P SB = 12 |HSB | · PSB,sw + P tx/rx (SB), DSB (s) = 0 ESB (s, h) = ESB (ps,max,h ), QSB (h) = 2 · q Processing engine PE
APE = APE,mem + APE,func + Atx/rx
Area:
MAm,PE ·TPE,mem
Average power:
P PE = P tx/rx (PE) + maxm∈MPE Pm,PE + PPE,mem ·
Energy, delay, buffer size:
· UPE , QPE (s) = 2 · ps,max EPE (s) = EmS ,PE , DPE (s) = φ m,PE s,PE ·fPE
EC
ECm,PE/fPE
Flow segment s
{E, D}(s) = {E, D}PE(s,dst) (s) + SB∈GSB,s {E, D}SB (s) + h∈Hs {E, D}h (s) Q PE(s,src) (s) + SB∈GSB,s QSB (s) + QPE(s,dst) (s) if GSB,s = {} Q(s) = (1/2) Q PE(s,dst) (s) else
Energy, delay: Buffer size:
Packet flow pf Energy, delay, buffer size:
{E, D, Q}(pf ) = System SYS
s∈Spf {E, D, Q}(s)
Area, average power:
{A, P}SYS =
Energy, buffer size:
{E, Q}SYS =
Delay:
DSYS = max pf ∈PF D(pf )
12.1 TA B L E
g∈GSYS {A, P}g +
h∈HSYS {A, P}h
s∈S {E, Q}(s)
Resource consumption model for the components of the system and of the flow processing graph.
12
268
A Framework for Design Space Exploration
and the application running on it. The functions PE(s,src) and PE(s,dst) return the PE to which the source method resp. destination method of flow segment s is assigned. The functions src(h) and dst(h) return the source module and the destination module that link h is connected to.
12.6
A DESIGN SPACE EXPLORATION EXAMPLE The described design methods have resulted in a software tool called Network Application Mapper (NetAMap). It can map the characterized packet flow graph to multiprocessors that use different NoC architectures. The tool makes use of the data types and algorithms library LEDA [22] to implement the introduced models and methods. To demonstrate the feasibility of this approach, we have analyzed the two architecture variants outlined in Figure 12.1 We first describe the applied system parameters and their origins. Then we present the results of the design space exploration.
12.6.1
Application and System Parameters The resource consumption of the implementation was measured in the profiling environment PERFMON [21] with the profiling support available in the packet processing library PPL. The network simulator SAHNE [4] was used for generating the traffic. The received packet flows from one node in the simulation were handed over to a PE (equipped with the S-Core processor) that was simulated in a VHDL simulator. The PE executed all packet methods and handed over the output flow segments back to SAHNE. With PERFMON, the execution cycles and memory accesses of the packet methods were measured. Please note that only the maximum execution cycles obtained from this execution trace were used. For characterizing the packet methods, the worst-case execution cycles are normally necessary since we want to analyze the worst-case performance of our system. As no tool was available to determine the worst-case execution cycles, only the results from the execution trace could be used. The packet sizes and bandwidth requirements of the flow segments were measured with the PPL profiler. The annotated flow graph in Figure 12.2 shows the results. The numbers at the flow segments and packet methods can be converted as follows (see the legend of the figure for definition of the variables): Bs = α1 Blp , ECm = β1 ,
pmin = 8α2 ,
pmax = 8α3 ,
MAm = β2 · ECm
(12.20)
12.6
A Design Space Exploration Example
Please note that this approach can in general handle varying execution times for the same method on different PEs, caused by their hardware assists. However, for these experiments, all PEs have the same capabilities, hence the execution time is equal on all PEs. We also use a constant switched capacity Cm,PE (see Table 12.2) for all methods to simplify the power computation. The size of the packets in Figure 12.2 is given in bytes, hence they have to be converted into bits in Equation (12.20). The annotated flow segments in Figure 12.2 have varying bandwidth requirements since internal control information are added and/or removed from the packets in each layer. For example, the logical port (external receiver) includes status information such as the reception signal strength in the received packets. Hence, the bandwidth requirement of the flow segment between the logical port and the ipp method of the physical layer is 1.4 times higher than the port data rate. This can be computed by relating the size of the new packet to the size of the received packet. Since the transfer time of the new packet has to be equal to the transfer time of the old one, the bandwidth of the outgoing flow segments has to be increased if the packet becomes larger or decreased if it becomes smaller. The scaling factor can be computed by dividing the size of the modified packet by the size of the corresponding original packet. The bandwidth requirements of the flow segment to the ipp method of the network layer is decreased since the medium access layer removes the additional information and its own header from the packet when it forwards a received data packet to the network layer.
Component
Parameter values
System SYS
|Llp | = 8, Blp = [0.2 : 0.55 : 4.05]Mbps, |GSB,S | = 0, |GSB,A | = |GPE,A |, |GSB,B | = |GPE,B |/4, GPE,S | = {1}, |GPE,A | = {4, 9, 16, 25, 36, 9, 64}, |GPE,B | = {4, 16, 36, 64}, |GPP,S | = 1, |GPP,{A,B} | = 4( |GSB{A,B} | − 1), Udd = 1.2 V
Processing engine PE
fPE = 230.000 MHz, TPE,mem = 2.22 ns, Nwait,PE = 0, wPE,mem = 32 bit, APE,func = 0.210 mm2 , APE,mem = 1.10 mm2 (128 kB), Cm,PE ≡ 69.4 pF, PPE,mem = 179.82 mW
Switch box SB
QSB,max = 2 KB, q = 128 bit, qˆ = 8 bit, wSB,mem = 32 bit, TSB,mem = 1.43 ns, ASB,mem = 0.053 mm2 , ASB,{sw,func} = 0 mm2 , PSB,mem = 96.91 mW, PSB,{sw,func} = 0 mW
Link h
wh,{S,A} = 0.232 mm, wh,B = 0.280 mm, lh,sh,A = lh,S = 0.465 mm, lh,lg,A = lh,PP,B = 1.159 mm, lh,sh,B = 0.561 mm, lh,lg,B = 2.318 mm,
Atx/rx = 32734.72 µm2 , Bh = 1.00 Gbps, Ph = 24.52 mW + 6.66 W/m · lh
12.2 TA B L E
Component parameters for the flow processing graph and the system.
269
270
12
A Framework for Design Space Exploration
Since the control packets of the medium access layer are exchanged only during the contact time window of the nodes, the bandwidth requirements of the flow segments between the local medium access layers and the global one are low. In this example, 10 percent of the port data rate were reserved for these control packets. There also exist feedback flow segments from the PHY opp method to the Sector-MAC opp method and from the Sector-MAC bpp method to the NET opp method. They are used to inform the higher layer when a packet has been transmitted (flow control for outgoing packets). In Table 12.2, the parameters of the systems SYS{S, A, B} and the system components concerning the exemplary design space exploration are described. SYSS is a system with a single processing engine, one physical link, and no switch box. SYS{A,B} are the systems explained in Figure 12.1. |GPE,{S, A, B} | denotes the number of processing elements of the system. The given number of PEs determines the amount of switch boxes |GSB,{S, A, B} |, which then leads to the number of physical ports |GPP,{S, A, B} |. The obtained power consumption values and area specifications relate to the synthesis results evaluated by the Synopsys Design Analyzer for a 1.2 V/130 nm UMC standard cell CMOS technology [23]. The estimations for the memory area A{PE,SB},mem and for the memory power consumption P{PE,SB},mem as well as the access times T{PE,SB},mem were collected from data sheets from MoSys 1T-SRAM-Q [24] for the PEs and from Virtual Silicon [25] using DP-SRAM for storing the flits in the SBs. The processing engines consist of functional units (CPU, 128-bit transceiver, hardware accelerators) and 128 kB memory. For the memory, the especially areasaving one transistor SRAM technology from MoSys has been used. It needs significantly less area and only 25 percent of the power consumption characteristic of a normal SRAM. The area and power for the functional units of the PEs are based on [21]. The switch boxes consist of functional units and a dual-port SRAM of size QSB,max = 2 kB for storing the flits. As they are not implemented yet, the LUT and the corresponding logic are not regarded in this analysis. Because of the smaller number of transceivers, the switch boxes of system A are smaller than the switch boxes of B. Due to the smaller number of switch boxes in B, system B has a relative area advantage in comparison to A. The values of the on-chip interconnects h{sh,lg,PP} have been calculated on the basis of discrete wire models from UMC. Short links hsh are employed to link the PEs to the switch boxes, while long links hlg interconnect the switch boxes (see Figure 12.1). Physical port links hPP connect the switch boxes with the physical ports. The long links are significantly longer than the short ones, and thus represent a higher capacitive and resistive load, which leads to a higher power consumption during communication via these interconnects. The physical port links and the long links are routed on the eighth metal layer. In this model,
12.6
A Design Space Exploration Example
the long links run along channels that require an area of wh,{A,B} · lh,lg,{A,B} . This area is not used for other functional units of the system, which may be seen as a worst-case estimation. In contrast, the short links are routed on the seventh metal layer. They run diagonally from the switch boxes to the PEs. The actual transceiver logic tx/rx consists of UMC balanced buffer standard cells and flip flops.
12.6.2
Results Architectures A and B differ in the total number of links (A has more links than B), the resulting number of SBs (B has four times less SBs than A), and the number of physical ports (B has fewer ports than A). Since a shared-memory SB is used, the memory bandwidth limits the data rate that the SB can handle per link. For the applied Virtual Silicon memory inside the SB, the available transfer bandwidth to the memory is utilized by 22 percent for five links and by 36 percent for eight links. This means that the link data rate can be even further increased. However, since memory bandwidth is reserved inside the PE for storing and reading the packets [Equation (12.17)], a higher link data rate results in reduced computational power inside the PE. For the applied link data rate of 1 Gb/s, 27 percent of the memory bandwidth of the PE has to be reserved for the link interface. For the experiments, the number of PEs for the two system variants and the port data rate (see Table 12.2) were changed. Figure 12.7 shows the system configurations that are able to process the packets at the given port data rate. Starting with a data rate of 4.05 Mb/s, the computational power of a single PE is not high enough to execute a single local MAC layer, hence the packet flow graph can not be executed anymore. An example that the optimization strategy does not always meet its goals is the system B with 16 PEs. It has enough computational power to execute the flow processing graph. However, the delayoptimization fails. This is due to the fact that the optimization is only optimal per cluster. During the delay-optimization, the PEs in the cluster are uniformly utilized to achieve the lowest delay. However, when the global layers are assigned to PEs, all PEs are already utilized and they have not left enough computational power to execute the global layers. Hence, the global layers cannot be assigned and the mapping fails. For the energy optimization, as few PEs as possible are used to minimize the internal communication. Hence, the PEs are either not loaded or highly loaded. When the global layers are mapped, they have to be mapped to free PEs since the others are highly loaded. Since the free PEs do not execute any other layer, they have enough computational power left to process the global layer. This suggests that for delay-optimization, the global
271
12
|GPE |
272
A Framework for Design Space Exploration
64
{A,B}{D,E}
{A,B}{D,E}
{A,D}{D,E}
{A,B}{D,E}
{A,B}{D,E}
{A,B}{D,E}
{A,B}{D,E}
49
A{D,E}
A{D,E}
A{D,E}
A{D,E}
A{D,E}
A{D,E}
A{D,E}
36
{A,B}{D,E}
{A,B}{D,E}
{A,B}{D,E}
{A,B}{D,E}
{A,B}{D,E}
{A,B}{D,E}
{A,B}{D,E}
25
A{D,E}
A{D,E}
A{D,E}
A{D,E}
A{D,E}
A{D,E}
A{D,E}
16
{A,B}{D,E}
{A,B}{D;E}
{A,B}{D,E}
{A,B}{D,E}
{A,B}{D,E}
BE
BE
9
A{D,E}
A{D,E}
A{D,E}
A{D,E}
A{D,E}
4
{A,B}{D,E}
{A,B}{D,E}
1
S{D,E} 0.2
Throughput achieved Throughput not achieved 0.75
1.3
1.85
2.4
2.95
3.5
4.05
Blp (Mbps)
12.7 FIGURE
Explored parameter space. The ID strings denote the architecture (A or B) and the optimization strategy (Energy or Delay). The grey boxes show systems that are able to process the packet flow graph at the given throughput Blp per port. The brackets indicate that both systems/both optimization strategies could be successfully used.
layers should be mapped before the local ones. For similar reasons, the energyoptimization for system A with 16 PEs fails for higher port data rates. Here, the arrangement of the ports in A enforce the energy optimization to uniformly load the PEs, leaving not enough computational power to execute the global layers. The average power consumption is below 240 mW for all examined systems. Hence, such a multiprocessor can be used for our mobile applications. The energy for communication between PEs has a proportion below 19 percent of the total energy consumption per packet, even for the largest system with 64 PEs that contains the longest paths. The delay-optimized system A with 64 processors has the worst-case buffer usage if a port data rate of 2.95 Mb/s is used. A total of 15 KB of the 128-KB buffer space provided by the SBs is used, compared to a buffer usage of 31 KB inside the PEs. The links are the less utilized components in the system; an average utilization of only 1.1 percent is reached for the energy-optimized 16 PE B system at a port data rate of 3.5 Mb/s. The largest systems with 64 PEs require a die size below 185 mm2 , of which nearly half is occupied with links. This suggests that the designer should also try to use the area underneath the links for logic to achieve a better area utilization. The critical protocol layers are the local Sector-MAC layers. A single layer completely utilizes one PE for Blp = 3.5 MB. The global Neighbor-MAC layer requires 76 percent
12.6
A Design Space Exploration Example
273
of the PE’s computational power. This shows that the framework can also be used to find the hot spots of the system and to guide the improvement of the implementation. Pareto-optimal system configurations for two exemplary port data rates are shown in Figures 12.8 and 12.9. For Blp = 3.5 Mb/s, the delay-optimized system B with 64 PEs achieves the lowest delay. Most of the 64 PEs are not used, since the flow graph contains only 18 layers. However, compared to the 36 PE configuration, more links are available to forward the flow segments between the PEs. Since the delay optimization is used, the routing algorithm tries to minimize the load on the links by utilizing all links uniformly, resulting in less delay compared to the 36 PE system. The lowest energy and area are required for the energy-optimized system B with 16 PEs, but at the cost of the highest delay. The energy- and delay-optimized systems B with 36 PEs and the energy-optimized system A with 25 PEs have nearly the same resource consumptions. A achieves the lowest area and delay at the highest energy consumption, compared to the other two systems. The delay-optimized system B has a lower energy consumption at the cost of a slightly increased delay compared to A. Energy-optimized
D (ms)
1.86
B16E
A36D
1.84
A25D
1.82
B36D B36E A25E
1.8
A64E
A49E
A36E
A64D
A49D B64E B64D
50 100 A (mm2)
98 96
150
92
94 E (mWs)
90 Legend:
12.8 FIGURE
Pareto-optimal/
Not pareto-optimal system configuration
Resource consumption of explored system configurations for Blp = 3.5 Mbps. The ID strings denote the architecture (A or B), the number of PEs |GPE |, and the optimization strategy (Energy or Delay).
12
274
D (ms)
4
A Framework for Design Space Exploration
A9E A16E B16E B64E
3 A9DB16D 2
A16D A25D A36D A25E A36E A49D B36E B36D A49E A64E B64D
A64D
50 100 A Legend:
12.9
(mm2)
92
150
Pareto-optimal/
88
90
94
96
98
E (mWs)
Not pareto-optimal system configuration
Resource consumption of explored system configurations for Blp = 1.3 Mb/s.
FIGURE
system B achieves the lowest energy consumption among the three systems at the cost of the highest delay. For the Blp = 1.3 Mb/s data rate, it is interesting to observe that the paretooptimal system A consumes always more energy and produces always a slightly higher delay than B with the next higher number of processors, for example, A with 49 PEs compared to B with 64 PEs. The higher energy consumption is due to the fact that the paths for forwarding flow segments in architecture A contain a higher number of SBs than in B, hence more memory accesses have to be performed that consume more energy. However, architecture A should have a delay advantage since it contains more links. The routing algorithm makes use of these additional links to reduce the average link utilization. Since the delay per link is proportional to the link utilization [Equation (12.4)], lower delays are associated with smaller link loads. However, the delay is also proportional to the number of SBs in the path. The path lengths in A are higher, hence there exists a tradeoff. In this example, the link utilization is very low, hence the path lengths dominate the delay. Therefore, architecture A has an area advantage compared only to B. However, if the link utilization is higher, the situation can change. This shows that this design method can be used to study the dependencies between the various application and system parameters.
Acknowledgments
12.7
CONCLUSIONS We have described a framework for implementing network protocols on multiprocessor SoCs. We have shown how to model the network protocol and the hardware architecture, how to schedule the execution to guarantee a userdefined throughput per port, how to assign the protocol functions to processors, and how to estimate the resource consumption of the final mapping with a component-based estimation model. During the mapping, either delay or energy consumption per packet can be minimized. The methods can be applied in an early design phase and can be used to explore the design space created by different system parameters. A detailed exploration example has been presented to show how a medium access protocol for mobile ad hoc networks can be mapped to a multiprocessor SoC with up to 64 processors. The analysis has shown that such a system can achieve reasonable data rates for the examined application at a low power consumption. This methodology can also be used to quickly identify the bottlenecks of the system such as protocol functions whose computational complexity prevents a higher throughput. The framework cannot yet handle external memories. A possible way to include them is to add memory interfaces to border PEs and to form groups of border PEs that share one memory channel. The external memory can be used to store large data structures such as packet queues or filter tables. By knowing the access pattern of the packet methods and the contention resolution mechanism on the shared memory channel, the worst-case memory access times can be determined analytically. Hence, the influence of the external memory accesses on the execution time of the packet methods can be modeled and included in the scheduling scheme. In our future work, we want to implement the analyzed systems to prove the described concepts and to enhance our framework by adding more details, for example, external memories or overhead for task switching. We will also analyze those protocols for wired networks that are used in the network access domain.
ACKNOWLEDGMENTS This work was funded by the Deutsche Forschungsgemeinschaft (German Research Foundation) within the scope of the Collaborative Research Center 376 “Massive Parallelität: Algorithmen, Entwurfsmethoden, Anwendungen.” The work was also supported by Infineon Technologies AG, especially the Corporate Research Systems Technology department (CPR ST, Professor Ramacher). The GigaNetIC project outlined in this report was funded by the Federal Ministry of Education and Research (Bundesministerium für Bildung und Forschung), registered
275
12
276
A Framework for Design Space Exploration
there under 01M3062A. The authors of this publication are fully responsible for its contents.
REFERENCES [1]
“The Network Simulator—ns-2,” www.isi.edu/nsnam/ns.
[2]
O. Bonorden, N. Brüls, D. K. Le, U. Kastens, F. Meyer auf der Heide, J.-C. Niemann, M. Porrmann, U. Rückert, A. Slowik, and M. Thies, “A holistic methodology for network processor design,” Proceedings of the Workshop on High-Speed Local Networks held in conjunction with the 28th Annual IEEE Conference on Local Computer Networks, October 20–24, 2003, pp. 583–592.
[3]
M. Grünewald, T. Lukovszki, C. Schindelhauer, and K. Volbert, “Distributed maintenance of resource efficient wireless network topologies,” Proceedings of the European Conference on Parallel Computing (Euro-Par), Paderborn, Germany, 27–30. August 2002, pp. 935–946.
[4]
S. Rührup, C. Schindelhauer, K. Volbert, M. Grünewald, “Performance of distributed algorithms for topology control in wireless networks,” Proceedings of the International Parallel and Distributed Processing Symposium, Nice, France, April 22–26, 2003.
[5]
A. Rˇadulescu, and K. Goossens, “Communication Services for Networks on Chip,” in Domain-Specific Processors: Systems, Architectures, Modeling, and Simulation, editors: S. Bhattacharyya, E. Deprettere, and J. Teich, Marcel Dekker, pp. 275–299, 2003.
[6]
W. J. Dally and B. Towles “Route packets, not wires: on-chip interconnection networks,” Proceedings of the Design Automation Conference, Las Vegas, Nevada, June 18–22, 2001, pp. 684–689.
[7]
P. Guerrier and A. Greiner, “A generic architecture for on-chip packet-switched interconnections,” Proceedings of Design, Automation and Test in Europe, 2000, pp. 250–256.
[8]
A. Brinkmann, J.-C. Niemann, I. Hehemann, D. Langen, M. Porrmann, and U. Rückert, “On-chip interconnects for next generation system-on-chips,” Proceedings of the 15th Annual IEEE International ASIC/SOC Conference, September 2002, pp. 211–215.
[9]
S. Kumar, A. Jantsch, J.-P. Soinien, M. Forsell, M. Millberg, J. Tiensyrjä, and A. Hemani, “A network on chip architecture and design methodology,” Proceedings of the IEEE Computer Society Annual Symposium on VLSI, 2002, pp. 117–124.
[10]
E. Rijpkema, K. Goossens, A. Rˇadulescu, J. Dielissen, J. van Meerbergen, P. Wielage, and E. Waterlander, “Trade offs in the design of a router with both guaranteed and best-effort services for networks on chip,” Proceedings of Design, Automation and Test in Europe, 2003, pp. 350–355.
[11]
P. P. Pande, C. Grecu, A. Ivanov, and R. Saleh, “High-throughput switch-based interconnect for future SoCs,” Proceedings of the 3rd IEEE International Workshop on System-On-Chip for Real-Time Applications, Calgary, Alberta, Canada, June 30–July 2, 2003, pp. 304–310.
References
277
[12]
P. Crowely and J.-L. Baer, “A Modeling Framework for Network Processor Systems,” Network Processor Design: Issues and Practices, volume 1, Morgan Kaufmann Publishers, September 2002, pp. 167–188.
[13]
M. A. Franklin and T. Wolf, “A Network Processor Performance and Design Model with Benchmark Parameterization,” Network Processor Design: Issues & Practices, volume 1, Morgan Kaufmann Publishers, September 2002, pp. 117–139.
[14]
M. A. Franklin and T. Wolf, “Power Considerations in Network Processor Design,” Network Processor Design: Issues and Practices, volume 2, Morgan Kaufmann Publishers, September 2003, chapter 3.
[15]
L. Thiele, S. Chakraborty, M. Gries, and S. Künzli, “Design Space Exploration of Network Processor Architectures,” Network Processor Design: Issues and Practices, volume 1, Morgan Kaufmann Publishers, September 2002, pp. 55–89.
[16]
G. Memik and W. H. Mangione-Smith, “NEPAL: A Framework for Efficiently Structuring Applications for Network Processors,” Network Processor Design: Issues and Practices, volume 2, Morgan Kaufmann Publishers, September 2003, chapter 10.
[17]
D. Langen, J.-C. Niemann, M. Porrmann, H. Kalte, and U. Rückert, “Implementation of a RISC processor core for SoC designs FPGA prototype vs. ASIC implementation,” Proceedings of the IEEE-Workshop: Heterogeneous Reconfigurable Systems on Chip (SoC), Hamburg, Germany, 2002.
[18]
A. K. Parekh and R. G. Gallager, “A generalized processor sharing approach to flow control in integrated service networks: The single node case,” IEEE / ACM Transactions on Networking, June 1993, vol. 1–3, pp. 344–357.
[19]
M. Grünewald, J.-C. Niemann, M. Porrmann, and U. Rückert, “A mapping strategy for resource-efficient network processing on multiprocessor SoCs,” Proceedings of Design, Automation and Test in Europe, CNIT Le Défense, Paris, France, February 16–20, 2004, pp. 758–763.
[20]
T. Šimuni´c, L. Benini, and G. De Micheli, “Energy-efficient design of battery-powered embedded systems,” Special Issue of IEEE Transactions on VLSI, 2001 vol. 9, pp. 15–28.
[21]
M. Grünewald, J.-C. Niemann, and U. Rückert, “A performance evaluation method for optimizing embedded applications,” Proceedings of the 3rd IEEE International Workshop on System-On-Chip for Real-Time Applications, Calgary, Alberta, Canada, June 30–July 2, 2003, pp. 10–15.
[22]
K. Mehlhorn and S. Näher, The LEDA Platform of Combinatorial and Geometric Computing, Cambridge University Press, 1999.
[23]
UMC, eSilicon™ Embedded Components, eSi-Route/9™ Standard Cell Library, UMCL13U21T3, Rev. 2.1, October 2001.
[24]
Mark-Erik Jones, Wingyu Leung, and Fu-Chieh Hsu, The Ideal SoC Memory: 1T-SRAM™ , MoSys, Inc., 2002.
[25]
Virtual Silicon, Two-Port SRAM Compiler UMC 0.13 µm (L130E-HS-FSG), June 2003.
13 Application Analysis and Resource CHAPTER
Mapping for Heterogeneous Network Processor Architectures Ramaswamy Ramaswamy, Ning Weng, Tilman Wolf Department of Electrical and Computer Engineering, University of Massachusetts at Amherst
Computer networks have progressed from simple store-and-forward communication networks to more complex systems. Packets are not only forwarded, but also processed on routers in order to be able to implement increasingly complex protocols and applications. Examples for such processing are network address translation (NAT) [1], firewalls [2], web switches [3], TCP/IP offloading for high-performance storage servers [4], and encryption for virtual private networks (VPN). To handle the increasing functional and performance requirements, router designs have moved away from hard-wired ASIC forwarding engines. Instead, software-programmable network processors (NPs) have been developed in recent years. These NPs are typically single-chip multiprocessors with highperformance I/O components. A network processor is usually located on each input port of a router. Packet-processing tasks are performed on the network processor before the packets are passed through the router switching fabric and on to the next network link. This is illustrated in Figure 13.1. Commercial examples for such systems are the Intel IXP2800 [5], IBM PowerNP [6], and EZchip NP-1 [7]. Due to the performance demands on NP systems, not only general-purpose RISC processor cores are used, but also a number of specialized coprocessors. It is quite common to find coprocessors for checksum computation, address lookup, hash computation, encryption and authentication, and memory management functions. This leads to network processor architectures with a number of different processing resources. The trend towards more heterogeneous NP architectures will continue with the advances in CMOS technology as an increasing number of processing resources can be put on an NP chip. This allows for NP
13
280
Application Analysis and Resource Mapping
Router
Port
Port
Packets
Back-plane
Router port
Processor Processor Processor core core core Processor core Coprocessor
Processor Interconnect
Network interface
Network processor
Port
Coprocessor
I/O
13.1 FIGURE
Packet data path on network router. Packets are shown as shaded boxes. Packet processing is performed on a network processor that is located at the input port of the system. The NP has a number of heterogeneous processing resources (processors cores and coprocessors).
architectures with more coprocessors; particularly those that implement more specialized, less frequently used functions. The heterogeneity of NP platforms poses a particularly difficult problem for application development. Current software development environments (SDKs) are already difficult to use and require an in-depth understanding of the hardware architecture of the NP system (something that traditionally has been abstracted by SDKs). Emerging NP systems with a large number of heterogeneous processing resources will make this problem increasingly difficult as the program developer will have to make choices on which hardware units to use for which tasks. Such decisions can have significant impact on the overall performance of the system as poor choices can cause contention on resources. One way to alleviate this problem is to profile and analyze the NP applications and make static or run-time decisions on how to assign processing tasks to a particular NP architecture. This process of identifying and mapping processing tasks to resources is the topic of this chapter.
13
Application Analysis and Resource Mapping
Our approach to this problem is to analyze the run-time characteristics of NP applications and develop an abstract representation of the processing steps and their dependencies. This creates an annotated directed acyclic graph (ADAG), which is an architecture-independent representation of an application. The annotations indicate the processing requirements of each block and the strength of the dependency between blocks. The basic idea is that we build the application representation bottom-up. We consider each individual data and control dependency between instructions and group them into larger clusters, which make up the ADAG. The ADAG can then be used to determine an optimal allocation of processing blocks to any arbitrary NP architecture. Our contributions are as follows: 1. A methodology for automatically identifying processing blocks from a runtime analysis of NP applications. 2. An algorithm to group cohesive processing blocks into processing clusters and a heuristic to efficiently approximate this NP-complete problem. The result is an application graph (ADAG) that is an architecture-independent description of the processing requirements. 3. A mapping (or scheduling) algorithm to dynamically allocate processing clusters to processing resources on arbitrary network processor architectures. We present the results for all these points using four realistic applications. One of the key points of this work is that the ADAG can be created completely automatically from a run-time instruction trace of the program on a uniprocessor system. The results from this work can help in a number of ways. The ability to identify cohesive processing blocks in a program is crucial to support high-level programming abstractions on heterogeneous NP platforms. Also, quantitative descriptions of the processing steps in terms of processing complexity and amount of communication between processing steps are the basis for any efficient scheduling. Further, the ADAG representation of an application gives very intuitive insights into the type of parallelism present in the application (e.g., multiprocessing vs. pipelining). Finally, the proposed scheduling algorithm can map packets at run-time to processing resources, which is superior to static approaches, that are dependent on an a priori knowledge of traffic patterns. In Section 13.1, we briefly discuss related work. Section 13.2 discusses the run-time analysis of NP applications and how we obtain the necessary profiling information. To get from the profiling information to an ADAG representation, we use a clustering algorithm that is discussed in Section 13.3. The results of the application analysis and the resulting ADAGs are presented for four applications in Section 13.4. The scheduling algorithm that maps the ADAGs to processing
281
13
282
Application Analysis and Resource Mapping
resources is presented in Section 13.5. Finally, Section 13.6 summarizes and concludes this chapter.
13.1
RELATED WORK There has been some work in the area of application analysis and programming abstraction for network processors. Shah et al. have proposed NP-Click [8] as an extension to the Click modular router [9] that provides architecture and program abstractions for network processors. This and similar approaches require an a priori understanding of application details in order to derive application modules. The problem with this approach is two-fold: 1. It assumes that the application developer can identify all application properties, which requires a deep understanding and careful analysis of the application. This will become an increasing problem as programming environments for network processors move towards higher-level and more complex abstractions. In such systems the application developer has less understanding on how a piece of code can be structured to match the underlying hardware infrastructure. 2. No run-time information is involved in application analysis. This is a crucial piece of information to use in making scheduling decisions. Using static application information only biases the results towards particular programming abstractions. Therefore, we feel that it is crucial that application characteristics can be derived automatically as we propose in our work. In NEPAL [10], Memik and Magione-Smith propose a run-time system that controls the execution of applications on a network processor. Applications are separated into modules at the task level using the NEPAL API and modules are mapped to execution cores dynamically. There is an extra level of translation during which the application code is converted to modules. We avoid this translation and work directly with the dynamic instruction trace to generate basic blocks at a fine-grained level. The problem of partitioning and mapping applications has been studied previously in the context of grid computing. Taura and Chien [11] present an approach that is similar to ours, but is used at a very high level to map an application on several heterogeneous computing resources on the Internet. Our work is similar to some of the ideas used in trace scheduling [12] and superblocks [13], but is applied to a different domain. Trace scheduling aims to generate optimized code for VLIW architectures by exploiting more
13.2
Application Analysis
instruction-level parallelism. Linear sequences of basic blocks are grouped to form larger basic blocks and dependencies are updated. We use a clustering algorithm for this purpose, which works on a dynamic instruction trace after all basic blocks have been identified. The exploration of application characteristics, modularization, and mapping to resources has also been studied extensively in the context of multiprocessor systems. An example for scheduling of DAGs on multiprocessors is Ref. [14]. In the network processing environment, applications are much smaller and a more careful analysis can be performed. The bottom-up approach that we propose in Section 13.3 is not feasible for very large applications, but yields very detailed application properties for packet-processing functions. Another difference between multiprocessors and NPs is that NPs aim at achieving high throughput rather than short delay in execution. Mapping and scheduling of DAGs or task graphs has been surveyed by Kwok and Ahmad [15] and we propose a scheduling algorithm similar to that introduced by El-Rewini and Lewis [16]. The main difference in both cases is that we are considering a heterogeneous processing platform, where tasks can take different amounts of processing time (e.g., depending on the use of generalpurpose processors vs. coprocessors). Similar algorithms have also been used in the VLSI CAD area. Clustering approaches similar to ours have been surveyed by Alpert and Kahng [17]. The methodology of clustering functionality proposed by Karypis et al. [18] is similar to ours, but applied to the VLSI domain.
13.2
APPLICATION ANALYSIS Our goal is to generate workload models that can be used for network processor scheduling independent of the underlying hardware architecture. There are a number of approaches that have been developed for multiprocessor systems, real-time systems, and compiler optimization that characterize applications, and our approach uses some of these well-known concepts. The main difference is that network processing applications are very simple and execute a relatively small number of instructions (as compared to workstation applications). This allows us to use much more detailed analysis methods that would be infeasible for large programs. Also, there are a few issues that are specific to the NP domain and usually are not considered for workstation applications. When exploiting parallelism in NP applications, we do not necessarily have to use multiple parallel processors for one packet, but we can also use pipelining. Additionally, the heterogeneity of processing resources requires that we can identify the portions of an application that can be executed on specialized coprocessors.
283
13
284
13.2.1
Application Analysis and Resource Mapping
Static vs. Dynamic Analysis One key question is whether to use a static or a dynamic application analysis as the basis for this work. With a static analysis, detailed information about every potential processing path can be derived. All processing blocks can be analyzed— even the ones that are not used or hardly used during run-time. A static analysis typically results in a call-graph, which shows the static control dependencies (i.e., which function can call which other function). This gives a good basic understanding of the application structure, but does not yield any information on its run-time behavior. But run-time behavior is exactly what is crucial for network processor performance. A run-time analysis of the application (e.g., an instruction trace) shows exactly which instructions were executed and which instruction blocks were not used at all. In addition, all actual load and store addresses are available, which can be used for an accurate data dependency analysis. The drawbacks of runtime analysis is that each packet could potentially cause a different sequence of execution. In a few cases, certain blocks are executed a different number of times depending on the size of the packet (we show this effect later in the context of packet encryption). Thus, the results are specific to a particular packet. To generalize these results to match arbitrary network traffic, a large number of packets can be analyzed and the union of all common execution paths can be used for the application analysis and scheduling. Less common execution paths can be handled as exceptions in the slow path of a network processor. We have chosen to follow the path of run-time analysis due to the fact that it more accurately reflects the actual processing as well as provides actual load and store addresses, which are important when determining data dependencies. To address the issue of variations in network traffic processing even within the same application (e.g., different packet services or number of loop executions), there are several solutions. One is to assume the packet uses the most common execution path and if not, an exception is raised and processing is continued on the control processor. This is currently done on some network processors (e.g., if IP options are detected in an IP forwarding application). Another approach is to analyze a large number of packets and find the union of all execution paths. By scheduling the union on the network processor system, it is guaranteed that all packets can be processed, but the drawback is a lower system utilization as not all components will be used at all times. In this work, we focus on the analysis of a single packet for each application with the understanding that the work can be extended to consider a range of network traffic. Another issue that arises from a run-time analysis is that it necessarily is done on compiled code. This introduces a certain bias toward a particular
13.2
Application Analysis
compiler and instruction set architecture. In our analysis, we used only compiler optimizations that are independent of the target system (e.g., no loop unrolling). Together with the use of a general RISC instruction set, the assumption is that the analysis yields results that are generally applicable to most processing engines in current network processors.
13.2.2
Annotated Directed Acyclic Graphs The result of the application analysis needs to yield an application representation that is independent of the underlying architecture and can later be used for mapping and scheduling. For this purpose, we use an annotated directed acyclic graph, which we call ADAG. The ADAG represents processing steps (or blocks) as vertices and dependencies as edges. The processing steps are dynamic processing steps; that is, they represent the instructions that are actually executed and loops cause the generation of processing blocks for each iteration. Only by considering the dynamic instances of each processing block can we obtain an acyclic graph. Also, it is desirable to consider only the instructions that are actually executed rather than any “dead code.” The dependencies that we consider are data dependencies as well as control dependencies. There are two key parameters in an ADAG: 1. Node weights. These indicate the amount of processing that is performed on each node (e.g., number of RISC instructions). 2. Edge weights. These indicate the amount of state that is transferred between processing blocks. Such an ADAG fully describes the processing and communication relationship for all processing blocks of an application.
13.2.3
Application Parallelism and Dependencies In order to make use of the parallelism in a network processor architecture, the ADAG should have as few dependencies between processing blocks as possible. This increases the potential for parallelizing and pipelining processing tasks. At the same time, dependencies that are inherent to the application must not be left out to assure a correct representation. We consider the following dependencies in our run-time analysis: ✦
Data dependencies. If an instruction reads a certain data location, then it becomes a dependent to the most recent instruction that wrote to this location. Note that any type of memory, including registers, needs to be considered.
285
13
286 ✦
Application Analysis and Resource Mapping
Control dependencies. If a branch occurs due to a computation (i.e., a conditional branch), then the branch target is dependent on the instructions that compute the condition for the branch. Note that unconditional branches do not cause dependencies as they are static and any potential dependencies between two blocks would be covered by data dependencies.
Note that these dependencies do not include anything related to resources. Resource conflicts are results of the underlying hardware and not a property of the application and thus not considered. Also other “hazards” [e.g., writeafter-read (WAR)] do not need to be considered, because the run-time trace is a correct execution of the application where all hazards have been resolved ahead of time. The result of the dependency analysis is an annotated run-time trace as shown in Figure 13.2. The trace sample is taken from an IPv4 forwarding implementation (details can be found in the Section 13.4). As is shown in Figure 13.2, data dependencies are tracked across registers as well as memory locations. Also, control dependencies between basic blocks are shown. Note that the resulting graph is directed and acyclic since dependencies can only “point downward” (i.e., no instruction can ever depend on a later instruction). Since the dependencies are limited to the absolute necessary dependencies (i.e., data and control), we get a DAG that is as sparse as possible and thus exhibits the maximum amount of parallelism. By focusing on these dependencies only, it is possible to find parallelism in the application despite the serialization that was introduced by running it on a uniprocessor simulator.
13.2 FIGURE
Instruction trace analysis example. Data dependencies between writes and reads in registers and memory locations are shown as well as control dependencies for conditional branches.
13.3
ADAG Clustering Using Maximum Local Ratio Cut
Note that the analysis is done as a post-processing step of an instruction trace from simulation. This trace contains effective memory addresses and information about all status bits that are changed during execution. This means that there is no need for memory disambiguation.
13.2.4
ADAG Reduction A practical concern of this methodology is that the number of processing blocks is large (in the order of the total instructions executed) and the representation of the DAG becomes unwieldy. Therefore, we take a simplifying step that significantly reduces the number of processing steps: instead of considering individual instructions, we consider basic blocks. A basic block is a group of instructions that are executed in sequence and has no internal control flow change. That is, the execution of a program can jump only to the beginning of a basic block and cannot jump somewhere else until the end of the basic block is reached. Still, all necessary dependencies are considered, but the smallest code fragment that can be parallelized or pipelined is a basic block. In Figure 13.2, basic blocks are separated by dashed lines. Even with a reduction to basic blocks, the resulting ADAG is not a suitable representation of an application, because it does not capture any higher-level application properties. The dependency between processing blocks can be very different depending on the nature of the application. Most applications show a “natural” separation between parts of the application (e.g., checksum verification and destination address lookups in IP forwarding), while showing a strong dependency within a particular part (e.g., all basic blocks of checksum computation). In order to consider such “clustering,” we further reduce the ADAG with the algorithm described in the following section.
13.3
ADAG CLUSTERING USING MAXIMUM LOCAL RATIO CUT When assigning processing steps to a network processor architecture, there are several points that need to be considered. Most of all, there is a tradeoff between the cost of processing (or the speed-up that is gained by using a coprocessor) and the cost of communication. This implies that it is not desirable to offload small processing blocks to coprocessors, especially when this requires a large amount of communication.
287
13
288
Application Analysis and Resource Mapping
Our ADAG generation results in a graph with thousands of basic blocks. The dependencies between them can cause a large amount of communication if the basic blocks were to be distributed to different computational resources. Thus, using our ADAG directly for workload mapping is not suitable. Instead, we want to reduce the number of processing components in the ADAG to yield a more natural, tractable grouping of processing instructions. For this purpose, we use a clustering technique called ratio cut [19]. Ratio cut has the nice property of identifying “natural” clusters within a graph without the need for a priori knowledge of the final number of clusters. The ratio cut algorithm is unfortunately NP-complete and thus not tractable for ADAGs with the number of nodes that we need to consider here. Therefore, we propose a heuristic that is based on ratio cut and reduces the computational complexity while still achieving good results. Our heuristic is called maximum local ratio cut (MLRC).
13.3.1
Clustering Problem Statement Before discussing the algorithm, let us formalize the problem that we address here. The ADAG, An = (Pn , Dn ), consists of a processing vector, Pn , and a dependency matrix, Dn . The processing vector contains the number of instructions that are executed in each of the n processing blocks. The dependency matrix contains the data values that need to be transferred between each pair of blocks. If dij is nonzero, the block i depends on block j because it reads dij data values that are written by block j. (Control dependencies are considered to be one-value dependencies.) The n blocks in An are ordered in such a way that the upper right of the dependency matrix is zero, which ensures that An is a directed acyclic graph. Thus, we have: 0 ··· ··· 0 .. . . d21 . . An = (Pn , Dn ) with Pn = ( p1 , . . . , pn ) and Dn = . . .. . .. . .. . . dn1 · · · dnn−1 0 (13.1) The goal of the clustering process is to generate a new ADAG, An , that is based on An , but smaller (n < n). Sets of nodes from An can be combined to clusters, which then become nodes in An . If m nodes i1 . . . im are combined to a clus ter node j, then pj = m k=1 pik . The nodes x on which j depends are updated such that djx = k={i1 ... im } dkx . The dependents y of j are updated accordingly
13.3
ADAG Clustering Using Maximum Local Ratio Cut
289
to dyj = l={i1 ... im } dyk . Basically, if nodes are clustered, then the new cluster combines the properties of all its nodes: the processing costs are added together and the dependencies are combined. A clustering step can be performed only if the resulting graph is still a directed acyclic graph (i.e., the dependency matrix can be reordered to have the upper right be zero). This clustering can be performed repeatedly in order to reduce n to the desired number of total clusters. Next, we discuss ratio cut, which is an algorithm to determine which nodes should be clustered and how many clusters the final solution contains.
13.3.2
Ratio Cut The basic concept of ratio cut is to cluster together nodes that show some natural cohesiveness, as described in detail in Ref. [19]. In our context, ratio cut clusters instruction blocks together such that: ✦
clusters perform a significant amount of processing and
✦
clusters have little dependencies between them.
The metric to determine these properties is the ratio cut, rij , for two clusters i and j, which is defined as:1 rij =
dij + dji . pi × p j
(13.2)
The ratio cut algorithm will cluster the graph such that rij is minimized. This means the dependencies between i and j are small and the amount of processing that is done by i and j is large. Note that either dij or dji has to be zero due to the acyclic property of an ADAG. Ratio cut operates in a top-down fashion. Starting from one cluster that contains all nodes (i.e., A1 ), the ratio cut is applied to find two groups that minimize rij . Then this process is applied recursively within each group. With each recursion step, the minimum ratio cut value will increase (because the clusters will have fewer and fewer clear separations). The clustering process can be terminated when the ratio cut exceeds a certain threshold (rij > tterminate ). The value of this threshold determines how tightly clustered the result is. If tterminate is small, then only few clusters will be found, but the separations between them will be dij +dji
1. The ratio cut described in Ref. [19] uses nodes of uniform size and thus rij = |i|×|j| . We are not interested in the number of blocks that are in each cluster, but the amount of processing that is performed. Thus, we adapted the definition of rij accordingly.
13
290
Application Analysis and Resource Mapping
very clear (i.e., little dependency). If tterminate is large, then many clusters will be found (all n blocks in the limit) and the dependencies between them can be significant (i.e., requiring large data transfers). In either case, the exact number of clusters is predetermined. This is why ratio cut is considered an algorithm that finds a natural clustering that depends on the properties of the graph (i.e., rij ). While ratio cut is an ideal algorithm for our purposes, it has one major flaw. It is NP-complete (for proof see Ref. [19]). Basically, it is necessary to consider an exponential number of potential clusterings in each step. This makes a practical implementation infeasible. The heuristics that have been proposed in Ref. [19] are also not suitable as they assume and require the graph to be undirected, which is not the case for our ADAG. To address this problem, we propose a heuristic that uses the ratio cut metric, but is less computationally complex.
13.3.3
Maximum Local Ratio Cut Instead of using the top-down approach that requires the exploration of a number of possible clusterings that grows exponentially with n, we propose to use a bottom-up approach in our heuristic, which we call maximum local ratio cut (MLRC). It is called “local” ratio cut, because MLRC makes a local decision when merging nodes. MLRC operates as follows: 1. 2. 3. 4.
Start with ADAG, Ai = An , that has all nodes separated. For each pair (i, j) compute the local ratio cut rij . Find the pair (imax , jmax ) that has the maximum local ratio cut. If the maximum ratio cut drops below the threshold (rimax jmax < tterminate ) stop the algorithm. Ai is the final result. 5. Merge i and j into a cluster resulting in Ainew = Ai−1 . 6. Set Ai = Ainew and repeat steps 2 through 6.
The intuition behind MLRC is to find the pair of nodes that should be least separated (i.e., one that has a lot of dependency and does little processing). This pair is then merged and the process applied recursively. As a result, clusters will form that show a lot of internal dependencies and little dependencies with other clusters. Of course, this is a heuristic and therefore cannot find the best solution for all possible ADAGs. The following intuition argues why MLRC performs well: ✦
If two nodes show a large ratio between them, it is likely that they belong to the same cluster in the optimal ratio cut solution.
13.4
ADAG Results
291
✦
By merging two nodes that exhibit a high local ratio cut, the overall ratio cut of A is reduced (in most cases), which leads to a better solution overall.
✦
The termination criterion is similar to that of ratio cut and leads to a similarly natural clustering.
We show results for four applications in Section 13.4 that demonstrate the performance of the algorithm for realistic inputs.
13.3.4
MLRC Complexity The maximum local ratio cut algorithm has a complexity that is tractable and feasible to implement. The algorithm runs over at most n iterations (in case tterminate is not reached until the last step). In each iteration the ratio cut for i2 /2 pairs needs to be computed [which takes O(1)]. Finding the maximum can easily be done during the computation. Thus, the total computational complexity is: n 2 i i=1
2
× O(1) = O(n3 ).
(13.3)
The space requirement for MLRC is O(n2 ), which is the same complexity that is required to represent An . Thus, MLRC is a feasible solution to the NP-complete ratio cut algorithm. In the following section, we show the performance of MLRC on a set of network processing applications.
13.4
ADAG RESULTS To illustrate the behavior and results of the application analysis, we use a set of four network processing applications. We briefly discuss the tool that we use to derive run-time traces and the details of the applications. Then we show the clustering process for one application and the final results for all four applications.
13.4.1
The PacketBench Tool In order to obtain runtime analysis of application processing, we use a tool called PacketBench that we have developed [20]. The goal of PacketBench is to emulate the functionality of a network processor and provide an easy-to-use environment
13
292
Application Analysis and Resource Mapping
Packet trace
Processed trace
Packet Preprocessing Packet memory management
Processor simulator (modified SimpleScalar)
PacketBench API
PacketBench Network processing application (e.g., IPv4 forwarding, packet classification, or encryption)
Selective accounting
Application processing statistics
PacketBench architecture. The application implements the packet-processing functionality that is measured. PacketBench provide support functions for packet and memory management. The simulator generates an instruction trace for the application (and not the framework) through selective accounting.
13.3 FIGURE
for implementing packet processing functionality. The conceptual outline of the tool is shown in Figure 13.3. The main components are: ✦
PacketBench framework. The framework provides functions that are necessary to read and write packets, and manage memory. This involves reading and writing trace files and placing packets into the memory data structures used internally by PacketBench. On a network processor, many of these functions are implemented by specialized hardware components and therefore should not be considered part of the application.
✦
PacketBench API. PacketBench provides an interface for applications to receive, send, or drop packets as well as doing other high-level operations. Using this clearly defined interface makes it possible to distinguish between PacketBench and application operations during simulation.
13.4
ADAG Results
✦
Network processing application. The application implements the actual processing of the packets. This is the processing that we are interested in as it is the main contributor to the processing delay on a router (e.g., packet classification for firewalling or encryption for VPN tunneling). The workload characteristics of the application needs to be collected separately from the workload generated by the PacketBench framework.
✦
Processor simulator. To get instruction-level workload statistics, we use a full processor simulator. In our current prototype we use SimpleScalar [21], but in principle any processor simulator could be used. Since we want to limit the workload statistics to the application and not the framework, we modified the simulator to distinguish operations accordingly. The Selective Accounting component does that and thereby generates workload statistics as if the application had run by itself on the processor. This corresponds to the actual operation of a network processor, where the application runs by itself on one of the processor cores. Additionally, it is possible to distinguish between accesses to various types of memory (instruction, packet data, and application state), which is useful for a detailed processing analysis.
The key point about this system design is that the application and the framework can be clearly distinguished—even though both components are compiled into a single executable in order to be simulated. This is done by analyzing the instruction addresses and sequence of API calls. This separation allows us to adjust the simulator to generate statistics for the application processing and ignore the framework functions. This is particularly important as network processing consists of simple tasks that execute only a few hundred instructions per packet [22]. Also, in real network systems the packet management functions are implemented in dedicated hardware and not by the network processor and thus should not be considered part of the workload. Another key benefit of PacketBench is the ease of implementing new applications. The architecture is modular and the interface between the application and the framework is well defined. New applications can be developed in C, plugged into the framework, and run on the simulator to obtain processing characteristics. In our prototype, the PacketBench executable is simulated on a typical processor simulator to get statistics of the number of instructions executed and the number of memory accesses made. We use the ARM [23] target of the SimpleScalar [21] simulator, to analyze our applications. This simulator was chosen because the ARM architecture is very similar to the architecture of the core processor and the microengines found in the Intel IXP1200 network processor [24], which is used commonly in academia and industry. The tools were set up to work
293
13
294
Application Analysis and Resource Mapping
on an Intel x86 workstation running RedHat Linux 7.3. PacketBench supports packet traces in the tcpdump [25] format and the Time Sequenced Header (TSH) format from NLANR [26]. The latter trace format does not contain packet payloads, so we have the option of generating dummy payloads of the size specified in the packet header. For the experiments that we perform in this work, the actual content of the payload is not relevant as no data-dependent computations are performed. The run-time traces that we obtain from PacketBench contain the instructions that are executed, the registers and memory locations that are accessed, and an indication of any potential control transfer. Using these traces we build an ADAG that considers dependencies among instructions as well as allows us to discover any potential parallelism. Since we make no assumption on the processing order other than the dependencies between data (see next subsection), we are able to represent the application almost independently from a particular system.
13.4.2
Applications The four network processing applications that we evaluate range from simple forwarding to complex packet payload modifications. The first two applications are IP forwarding according to current Internet standards using two different implementations for the routing table lookup. The third application implements packet classification, which is commonly used in firewalls and monitoring systems. The fourth application implements encryption, which is a function that actually modifies the entire packet payload and is used in VPNs. The specific applications are as follows: ✦
IPv4-radix. IPv4-radix is an application that performs RFC1812-compliant packet forwarding [27] and uses a radix tree structure to store entries of the routing table. The routing table is accessed to find the interface to which the packet must be sent, depending on its destination IP address. The radix tree data structure is based on an implementation in the BSD operating system [28].
✦
IPv4-trie. IPv4-trie is similar to IPv4-radix and also performs RFC1812based packet forwarding. This implementation uses a trie structure with combined level and path compression for the routing table lookup. The depth of the structure increases very slowly with the number of entries in the routing table. More details can be found in Ref. [29].
✦
Flow Classification. Flow Classification is a common part of various applications such as firewalling, NAT, and network monitoring. The packets passing
13.4
ADAG Results
through the network processor are classified into flows which are defined by a five-tuple consisting of the IP source and destination addresses, source and destination port numbers, and transport protocol identifier. The five-tuple is used to compute a hash index into a hash data structure that uses linked lists to resolve collisions. ✦
IPSec encryption. IPSec is an implementation of the IP Security Protocol [30], where the packet payload is encrypted using the Rijndael algorithm [31], which is the new Advanced Encryption Standard (AES) [32]. This algorithm is used in many commercial VPN routers. This is the only application where the packet payload is read and modified. It should be noted that the encryption processing for AES shows almost identical characteristics as the decryption processing. We do not further distinguish between the two steps.
The selected applications cover a broad space of typical network processing. IPv4-radix and IPv4-trie are realistic, full-fledged packet forwarding applications, which perform all required IP forwarding steps (header checksum verification, decrementing TTL, etc.). IPv4-radix represents a straightforward unoptimized implementation, while IPv4-trie performs a more efficient IP lookup. The applications can also be distinguished between header processing applications (HPA) and payload processing applications (PPA) (as defined in Ref. [22]). HPA process a limited amount of data in the packet headers and their processing requirements are independent of packet size. PPA perform computations over the payload portion of the packet and are therefore more demanding in terms of computational power as well as memory bandwidth. IPSec is a payload processing application and the others are header processing applications. The applications also vary significantly in the amount of data memory that is required. Encryption needs to store only a key and small amounts of state, but the routing tables of the IP forwarding applications are very large. Altogether, the four applications chosen in this work are good representatives of different types of network processing. They display a variety of processing characteristics as is shown later. To characterize workloads accurately, it is important to have realistic packet traces that are representative of the traffic that would occur in a real network. We use several traces from the NLANR repository [26] and our local intranet. The routing table for the IP lookup applications is MAE-WEST [33].
13.4.3
Basic Block Results The initial analysis of basic blocks and their dependencies yields the results shown in Table 13.1. Ipv4-radix executes the largest number of instructions and
295
13
296 Application
IPv4-radix IPv4-trie Flow Class. IPSec
Application Analysis and Resource Mapping
Number of
Number of
Maximum
Maximum
Basic Blocks
Unique Basic
Processing
Dependency
(n)
Blocks
(max(pi ))
(max(dij ))
2340
375
29
40
37
28
13
11
36
35
35
29
267
93
89
82
Results from application analysis.
13.1 TA B L E
has by far the most basic blocks. Note that the number of unique basic blocks is much smaller. This is due to the fact that many basic blocks are executed repeatedly during run-time. For Flow Classification almost all basic blocks are different indicating that there are no loops.
13.4.4
Clustering Results Using the MLRC algorithm, the basic block ADAG is step-by-step decreased in size. Figure 13.4 shows the last 10 (A10 . . . A1 ) steps of this process for the Flow Classification application. In each cluster, the name of the cluster (e.g., c0) and the processing cost (e.g., 25) are shown. The edges show the dependency between clusters (number of data transfers). Note that the cluster names change across figures due to the necessary renaming to maintain DAG properties (zeros in upper right of dependency matrix). The start nodes (i.e., nodes that are not dependent on any other nodes) are shown as squares. The end nodes (i.e., nodes that have no dependents) are shown with a thick border. The following can be observed: ✦
Aggregation of nodes causes the resulting cluster to have a processing cost equal to the sum of the nodes.
✦
Edges are merged during the aggregation.
✦
The number of parallel nodes decreases as the number of clusters decreases.
The first two observations follow the expected behavior of MLRC. The third observation is more interesting. The reduction in parallelism means that an application that has been clustered “too much” cannot be processed efficiently
13.4
ADAG Results
297
(a)
(e)
(b)
(f)
(c)
(d)
(g)
(h)
(i)
(j)
13.4 FIGURE
Sequence of ADAG clustering. ADAG for Flow Classification is shown for ten to one clusters. Each node shows the processing cost of the cluster and its name [e.g., 25 instructions for cluster c0 in (a)]. Note that the clusters are renamed with each merging step.
and in parallel on a network processor system. Therefore it is crucial to determine when to stop the clustering process. In Figure 13.5, the progress of two metrics in the MLRC is shown for all four applications. The plots show the value of the maximum local ratio cut (local ratio cut) and the number of parallel nodes. The local ratio cut value decreases with fewer clusters—as is expected. In a few cases, the local ratio cut value increases
0.01
1500
0.001 1000 0.0001 500
1e-05
Local ratio cut Parallel clusters
1e-06
Maximum local ratio cut value
(c)
2000
1500 1000 500 Number of clusters IPv4-radix
0
30 0.1
25 20
0.01 15 10
0.001
5
Local ratio cut Parallel clusters 0.0001
35
13.5
30
25 20 15 10 Number of clusters Flow Classication
20 15 10
0.001
5
Local ratio cut Parallel clusters
0 5
25
0.01
(d)
35
35 30
0.0001
0
1
1
0.1
Maximum local ratio cut value
0.1
Maximum local ratio cut value
2000
Number of parallel clusters
(b)
0
35
30
0
25 20 15 10 Number of clusters IPv4-trie
5
0
1
250
0.1
200
0.01
150
0.001
100
0.0001 1e-05
Number of parallel clusters
1
Number of parallel clusters
Maximum local ratio cut value
(a)
Application Analysis and Resource Mapping
50
Local ratio cut Parallel clusters
Number of parallel clusters
13
298
0 250
200 150 100 Number of clusters IPSec
50
0
Local ratio cut algorithm behavior.
FIGURE
after a merging step. This is due to MLRC being a heuristic and not an optimal algorithm. The initial local ratio cut value is one. For our applications, this is the worst case (e.g., occurring when there are two one-instruction blocks with one dependency) since there cannot be more dependencies than instructions. The number of parallel nodes is derived by counting the number of nodes that have at least one other node in parallel (i.e., there is no direct or transitive dependency). These nodes could potentially be processed in parallel on an NP system. Eventually this value drops to zero. For IPv4-trie, and Flow Classification, this happens at around 5 clusters, for IPv4-radix at 20 clusters, and for IPSec at around 50 clusters. This indicates that IPv4-radix and IPSec are applications that lend themselves more towards pipelining than towards parallel processing.
13.4
ADAG Results
13.4.5
Application ADAGs Figure 13.6 shows the ADAGs A20 for all four applications (independent of tterminate ). We can observe the following application characteristics:
13.4.6
✦
IPv4-radix is dominated by the lookup of the destination address using the radix tree data structure. This traversal of the radix tree causes the same loop to execute several times. Since we consider run-time behavior, each loop instance is considered individually. The patterns of processing blocks with 330, 181, 195, and 136 instructions in A20 show these instruction blocks. Another observation is that the lack of parallelism between blocks is indicative of the serial nature of an IP lookup. Even though the same code is executed, there are data dependencies in the prefix lookup, which are reflected in the one-data-value dependencies shown in Figure 13.6.
✦
IPv4-trie implements a simpler IP lookup than IPv4-radix. The lookup is represented by the sequence of clusters three to nine with mostly fivedata-value dependencies. IPv4-trie exhibits more parallelism, but still is dominated by the serial lookup.
✦
Flow Classification has two start nodes and a number of end-nodes, where processing does not have any further dependents. These are write updates to the Flow Classification data structure. Altogether, there is a good amount of parallelism and less serial behavior than in the other applications.
✦
IPSec is extremely serial and the encryption processing repeatedly executes the same processing instructions, which are represented by the blocks with 69 instructions and 49 or 46 data dependencies going into the block. This particular example executes the encryption of two 32-byte blocks. The transition from the first to the second block happens in cluster 4. This application shows no parallelism as is expected for encryption.
Identification of Coprocesser Functions The final question for application analysis is how to identify processing blocks that lend themselves for coprocessor implementations. There are some functions that by default are ideal for coprocessing that can be identified by the programmer (e.g., checksum computation due to its simplicity and streaming
299
13
300
(a) IPv4-radix
13.6 FIGURE
(b) IPv4-trie
ADAGs for workload applications.
Application Analysis and Resource Mapping
(c) Flow classification
(d) IPSec
13.4
ADAG Results
301 700 Unique instruction address
Unique instruction address
140 120 100 80 60 40 20 0
600 500 400 300 200 100
0
20
13.7 FIGURE
40
60
80 100 Instruction
120
140
160
0
0
500
1000 1500 Instruction
2000
2500
Detailed instruction access patterns of a single packet for Flow Classification and IPSec from the NLANR MRA Trace [26].
data access nature). We want to take a different look at the problem and attempt to identify such functions without a priori understanding of the application. The ADAGs show only how many instructions are executed by a processing block, but not which instructions. In order to identify if there are instruction blocks that are heavily used in an application, we use the plots shown in Figure 13.7. The x-axis shows the instructions that are executed during packet processing. The y-axis shows each unique instruction address observed in the trace. For example, in IPSec, the 400th unique instruction is executed sixteen time (eight times between instruction 500 and 1000 and eight times between 1500 and 2000). Figure 13.7 is a good indicator for repetitive instruction execution. For Flow Classification, there are almost no repeated instructions. In IPSec, however, there are several instruction blocks that are executed multiple times (sixteen times for instructions with unique address 350 to 450). If these instructions can be implemented in dedicated hardware, a significant speed-up can be achieved due to the high utilization of this function. Again, this method of coprocessing identification requires no knowledge or deep understanding of the application. Instead the presented methodology extracts all this information from a simple instruction run-time trace. One problem with this methodology is that processing blocks that execute nonrepetitive functions are not identified as suitable for coprocessors, even though they could be (as it is the case for Flow Classification). Such functions still need to be identified manually by the programmer.
13
302
13.5
Application Analysis and Resource Mapping
MAPPING APPLICATION DAGs TO NP ARCHITECTURES Once application ADAGs have been derived, they can be used in multiple ways. One way of employing the information from ADAGs is for network processor design. With a clear description of the workload and its parallelism and pipelining characteristics, a matching system architecture can be derived. Another example is the use of application ADAGs to map instruction blocks to NP processing resources. In this section, we discuss this mapping and scheduling in more detail.
13.5.1
Problem Statement The mapping and scheduling problem is the following: Given a packetprocessing application and a heterogeneous network processor architecture, which processing task should be assigned to which processing resource (mapping) and at what time should the processing be performed (scheduling)? For this problem, we assume that a network processor has m different processing resources r1 . . . rm . These processors can be all of the same kind (e.g., all general-purpose processors) or can be a mix of general-purpose processors and coprocessors. All processing resources are connected with each other over an interconnect. Transferring data via the interconnect incurs a delay proportional to the amount of data transferred. Thus, if the application uses multiple resources in parallel, the communication cost for the data transfer needs to be considered. An application is represented by an ADAG with n clusters c1 . . . cn and their processing costs and dependencies. Since we have processing resources with different performance characteristics, the processing cost for a cluster is represented by a vector pi = (pi 1, . . . pi m). This vector contains the processing cost for the cluster for each possible processing resource. If a cluster cannot be executed on a particular processing resource (e.g., checksum computation cannot be performed on a table-lookup coprocessor), the processing cost is ∞. The mapping solution, M, consists of n pairs that indicate the assignment of all clusters c1 . . . cn to a resource ri : M = ((c1 , ri1 ) . . . (cn , rin )). The schedule, S, is similar, except that is also contains a time t that indicates the start time of the execution of a cluster on a resource: S = ((c1 , ri1 , t1 ) . . . (cn , rin , tn )). Finally, a performance criterion needs to be defined that is used to find the best solution. This could be shortest delay (i.e., earliest finish time of last
13.5
Mapping Application DAGs to NP Architectures
cluster) or best resource usage (i.e., highest utilization of used resources). We use minimum delay in our example. Unfortunately, this problem, too, is NP complete. Malloy et al. established that producing a schedule for a system that includes both execution and communication cost is NP-complete, even if there are only two processing elements [14]. Therefore, we need to develop a heuristic to find an approximate solution. Mapping of task graphs to multiprocessors has been researched extensively and is surveyed by Kwok and Ahmad [15]. However, most of the previous work is targeted for homogeneous multiprocessor systems. Here, we consider the mapping of ADAGs onto a set of heterogeneous processing resources.
13.5.2
Mapping Algorithm In our example, we consider the mapping of a single ADAG onto the processing resources. The goal is to map it in such a way as to minimize the overall finish time of the last cluster. This mapping also yields maximum use of the application’s parallelism. We consider only one packet in this example, but the approach can easily be extended to consider the scheduling of multiple packets. There are two parts to our mapping and scheduling algorithm. First, we identify the nodes that are most critical to the timely execution of the packet (i.e., the nodes that lie on the critical path). For this purpose we introduce a metric called the criticality, ci , of a node i. The criticality is determined by finding the critical path (bottom-up) in the ADAG. The criticality is determined by looking at the processing time of each cluster when using a general-purpose processor (we assume that this is resource 1). For each end node e (no children), the criticality is just its default processing time: ce = pe1 . For all other nodes i, the criticality is the maximum criticality of its children plus its own processing time: ci = max cj + pi , ∀j : dji > 0. The clusters are then scheduled in order of their criticality such that each assignment achieves the minimum increase in the overall finish time. This requires that the current finish time of each node and resource has to be maintained. When determining the finish time of a cluster, fi , the finish time of all its parents (on which it depends) needs to be considered as well as the delay due to data transfers between different processing resources over the interconnect. Thus, the mapping and scheduling algorithm to heuristically find the earliest finish time of a processing application is: 1. Calculate each node’s criticality ci as defined previously. 2. Sort the nodes into a list L by decreasing criticality. 3. Dequeue node N with highest criticality from L.
303
13
304
Application Analysis and Resource Mapping
4. For each resource ri determine the finish time for N by adding the maximum finish time of all parents (plus interconnect overhead) to the processing time, pN i, of n on resource ri . Assign N to the resources that minimizes the finish time. 5. Repeat steps 3 through 5 until L is empty. The algorithm is developed based on the following observation: If one maps the critical path of the ADAG with minimal delay and all noncritical path nodes meet their deadlines, then the resulting schedule is optimal. So, the critical path gives us a global view of the ADAG, but mapping is done by local decisions to avoid exponential complexity. This algorithm uses a greedy approach: Given node N it tries to identify the processing element that yields the earliest finishing time by either (1) reducing communication cost and using the same resources as its parents or (2) by using a faster coprocessor and paying for communication delay. Our mapping and scheduling algorithm is based on the list scheduling techniques that are well explored under different assumptions and terminology. The criticality metric is similar to assigning a priority to the task as proposed by El-Rewini and Lewis [16] where it is called static bottom level. However, we use a similar algorithm and metric in the context of a heterogeneous system. Our metric, the early finishing time instead of early starting time, helps us explore the option of the fast processor when assigning one task to a potential processing element.
13.5.3
Mapping and Scheduling Results We show the results of this mapping and scheduling algorithm in Figure 13.8. It uses the A20 ADAG for Flow Classification and an NP architecture with four processors: three general-purpose processors and one coprocessor that requires only half the instructions for some of the clusters (for illustration, these were picked randomly and don’t reflect actual application behavior). The schedule in Figure 13.8 completes the processing of the packet at time 102. This is shorter than the original criticality of the start node due to the use of the coprocessor. Overall, it can be seen that the application parallelism is exploited and the processing resources of the network processor are used efficiently. A further exploration of this algorithm and its impact on a scenario, where the optimization criterion is system throughput, is currently work in progress.
(a)
C11 (4–8)
40
Schedule
C3 (35–45)
C6 (36–42)
70
C16 (77–102)
90
C17 (72–81)
80
100
Processing C15 (71–88) complete at time 102
C14 (80–86) C19 (86–90)
C13 (73–80) C12 (82–102)
C9 (68–71)
C4 (55–64)
C5 (64–57)
60
C7 (45–47)
50
C18 (42–45)
C8 (36–39)
C1 (25–28)
C2 (29–35)
30
Mapping and scheduling result for Flow Classification. The criticality graph shows the node name, the criticality ci , and the processing cost vector for general-purpose processors and the coprocessor. The schedule shows which processing step is allocated to which processor and at what time the processing is performed.
Processor 4 (GP)
Processor 3 (GP)
Processor 2 (GP)
20
C0 (0-25)
10
Mapping Application DAGs to NP Architectures
FIGURE
13.8
Criticality ci
0
C10 (0–4)
Processor 1 (GP)
(b)
13.5
305
13
306
13.6
Application Analysis and Resource Mapping
CONCLUSIONS In this chapter, we have introduced an annotated, directed, acyclic graph to represent application characteristics and dependencies in an architectureindependent fashion. We have developed a methodology to automatically derive this ADAG from run-time instruction traces that can be obtained easily from simulations. To consider the natural clustering of instructions within an application, we have used maximum local ratio cut (MLRC) to group instruction blocks and reduce the overall ADAG size. For four network processing applications, we have presented such ADAGs and shown how the inherent parallelism (multiprocessing or pipelining) can be observed. Using the ADAG representation, processing steps can be allocated to processing resources using a heuristic that uses the node criticality as a metric. We have presented such a mapping and scheduling result to show its behavior. We believe this is an important step towards automatically analyzing applications and mapping processing tasks to heterogeneous network processor architectures. For future work, we plan to further explore the issue of differences in run-time execution of packets and how it impacts the results from the analysis. We also want to compare the quality of the clustering obtained from minimum local ratio cut with that of other nongreedy ratio cut heuristics. Finally, it is necessary to develop a robust methodology for automatically identifying processing blocks for coprocessors and hardware accelerators.
REFERENCES [1]
K. B. Egevang and P. Francis, “The IP network address translator (NAT),” RFC 1631, Network Working Group, May 1994.
[2]
J. C. Mogul, “Simple and flexible datagram access controls for UNIX-based gateways,” USENIX Conference Proceedings, pp. 203–221, Baltimore, Maryland, June 1989.
[3]
G. Apostolopoulos, D. Aubespin, V. Peris, P. Pradhan, and D. Saha, “Design, implementation and performance of a content-based switch,” Proceedings of IEEE INFOCOM 2000, pp. 1117–1126, Tel Aviv, Israel, March 2000.
[4]
Hewlett-Packard Company, Maximizing HP StorageWorks NAS Performance and Efficiency with TCP/IP Offload Engine (TOE) Accelerated Adapters, March 2003, www.alacritech.com.
[5]
Intel Corp, Intel IXP2800 Network Processor, 2002, developer.intel.com/design/network/products/npfamily/ixp2800.htm.
[6]
J. Allen, B. Bass, C. Basso, R. Boivie, J. Calvignac, G. Davis, L. Frelechoux, M. Heddes, A. Herkersdorf, A. Kind, J. Logan, M. Peyravian, M. Rinaldi, R. Sabhikhi, M. Siegel,
References
307 and M. Waldvogel, “IBM PowerNP network processor: Hardware, software, and applications,” IBM Journal of Research and Development, 47(2/3):177–194, 2003.
[7]
EZchip Technologies Ltd., Yokneam, Israel, NP-1 10-Gigabit 7-Layer Network Processor, 2002, www.ezchip.com/html/pr_np-1.html.
[8]
N. Shah, W. Plishker, and K. Keutzer, “NP-Click: A programming model for the Intel IXP1200,” Proceedings of Network Processor Workshop in Conjunction with Ninth International Symposium on High Performance Computer Architecture (HPCA-9), pp. 100–111, Anaheim, California, February 2003.
[9]
E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek, “The Click modular router,” ACM Transactions on Computer Systems, 18(3):263–297, August 2000.
[10]
G. Memik and W. H. Mangione-Smith, “NEPAL: A framework for efficiently structuring applications for network processors,” Proceedings of Network Processor Workshop in Conjunction with Ninth International Symposium on High Performance Computer Architecture (HPCA-9), pp. 122–124, Anaheim, California, February 2003.
[11]
K. Taura and A. Chien, “A heuristic algorithm for mapping communicating tasks on heterogeneous resources,” Heterogeneous Computing Workshop, pp. 102–115, Cancun, Mexico, May 2000.
[12]
J. A. Fisher, “Trace scheduling: A technique for global microcode compaction,” IEEE Transactions on Computers, C-30(7):478–490, July 1981.
[13]
W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Oullette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery, “The superblock: An effective technique for VLIW and superscalar compilation,” The Journal of Supercomputing, 7(1–2):229–248, May 1993.
[14]
B. A. Malloy, E. L. Lloyd, and M. L. Souffa, “Scheduling DAG’s for asynchronous multiprocessor execution,” IEEE Transactions on Parallel and Distributed Systems, 5(5):498–508, May 1994.
[15]
Y.-K. Kwok and I. Ahmad, “Static scheduling algorithms for allocating directed task graphs to multiprocessors,” ACM Computing Surveys, 31(4):406–471, December 1999.
[16]
H. El-Rewini and T. G. Lewis, “Scheduling parallel program tasks onto arbitrary target machines,” Journal of Parallel and Distributed Computing, 9(2):138–153, June 1990.
[17]
C. Alpert and A. Kahng, “Recent directions in netlist partitioning: A survey,” Integration: The VLSI Journal, pp. 1–81, 1995.
[18]
G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, “Multilevel hypergraph partitioning: Application in VLSI domain,” Proceedings ACM/IEEE Design Automation Conference, pp. 526–529, Anaheim, California, June 1997.
[19]
Y.-C. Wei and C.-K. Cheng, “Ratio cut partitioning for hierarchical designs,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 10(7):911–921, July 1991.
[20]
R. Ramaswamy and T. Wolf, “PacketBench: A tool for workload characterization of network processing,” Proceedings of IEEE 6th Annual Workshop on Workload Characterization (WWC-6), pp. 42–50, Austin, Texas, October 2003.
13
308
Application Analysis and Resource Mapping
[21]
D. Burger and T. Austin, “The SimpleScalar tool set version 2.0,” Computer Architecture News, 25(3):13–25, June 1997.
[22]
T. Wolf and M. A. Franklin, “CommBench—A telecommunications benchmark for network processors,” Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 154–162, Austin, Texas, April 2000.
[23]
ARM Ltd, ARM7 Datasheet, 2003.
[24]
Intel Corp, Intel IXP1200 Network Processor, 2000, www.intel.com/design/network/products/npfamily/ixp1200.htm.
[25]
TCPDUMP Repository, www.tcpdump.org, 2003.
[26]
National Laboratory for Applied Network Research—Passive Measurement and Analysis, Passive Measurement and Analysis, 2003, www.pma.nlanr.net/PMA/.
[27]
F. Baker, “Requirements for IP version 4 routers,” RFC 1812, Network Working Group, June 1995.
[28]
NetBSD Project, NetBSD release 1.3.1, www.netbsd.org/.
[29]
S. Nilsson and G. Karlsson, “IP-address lookup using LC-tries,” IEEE Journal on Selected Areas in Communications, 17(6):1083–1092, June 1999.
[30]
S. Kent and R. Atkinson, “Security architecture for the internet protocol,” RFC 2401, Network Working Group, November 1998.
[31]
J. Daemen and V. Rijmen, “The block cipher Rijndael,” Lecture Notes in Computer Science, volume 1820, pp. 288–296. Springer-Verlag, 2000.
[32]
National Institute of Standards and Technology, Advanced Encryption Standard (AES), November 2001, FIPS 197.
[33]
Network Processor Forum, Benchmarking Implementation Agreements, 2003, www.npforum.org/benchmarking/bia.shtml.
Index 3-adder configurations, 129 4-adder configurations, 129, 130 8b/10b coding/decoding, 60–61 abstraction layer, 166–67 AckNackGen instruction class, 73 ack/nack protocol, 60–61, 65 Ack/Nack Rx instruction class, 73 Ack/Nack Tx instruction class, 73 acknowledgement (ACK) headers, 90 ADAGs (annotated acyclic directed graphs), 281, 285, 287 clustering, 287 clustering using maximum local ratio cut, 287–91 clustering problem statement, 288–89 maximum local ratio cut, 290–91 MLRC complexity, 291 overview, 287–88 ratio cut, 289–90 mapping to NP architectures, 302–8 mapping algorithm, 303–4 mapping and scheduling results, 304–8 overview, 302 problem statement, 302–3 results, 291–301 applications, 294–95 basic block results, 295–96 clustering results, 296–98 identification of co-processor functions, 299–301 overview, 291 PacketBench tool, 291–94 address validation, 60–61, 74 Advanced Encryption Standard, see AES Advanced Switching version, 63 AES (Advanced Encryption Standard), 122–31, 237 design methodology and implementation details, 123–25 encryption algorithm, 130 experiments, 125–30 cluster statistics, 129–30 key agility, 127–29 mode of operation, 130
overview, 125 varying number of clusters, 129 varying stream size, 125–27 overview, 122–23 packet encryption, 120 performance summary, 130–31 pipelined implementation, 236–38 aggregate compiler, 151–52, 162–64 Aho-Corasick algorithm, 201, 205, 207–9, 213–14, 215–17 annotated acyclic directed graphs, see ADAGs (annotated acyclic directed graphs) application-specific instruction set, 76 application state, 158 architecture irregularity, 163–64 arithmetic clusters, 121 arithmetic operations, 69 ARM architecture, 293 Average trace, 134 backlog levels, 71 Baker language, 4, 150, 152–58 Base specification, 62–63 basic block, 287–88 Basic Local Alignment Search Tool (Blast), 198–99 Bell Labs Research Murray Hill NJ, 42 best-case execution times, 189 best-effort code, 10 bi-directional path, 108 binsearch program, 18, 19, 21, 24 Bioccelerator project, 198 bio-computing fields, 197 bit-level masking, 69 Blast (Basic Local Alignment Search Tool), 198–99 block cipher, 237 Bloom filter, 34, 36, 44, 50 board-level interconnect standards, 77 Boyer-Moore algorithm, 216 bpp methods, 250–51, 252–53, 259, 261 branch instructions, 70 ByteSub Transformation, 237
INDEX
310 cable-modem termination systems (CMTS), 2 cacheable control store, 89 Calc CRC instruction class, 73 call-graph, 284 capturing network packets, 101–2 Carter-Wegman + Counter dual-use mode (CWC), 130 c-bit hash, 37 CFO (Control Flow Optimization), 164 check paint instruction class, 73 chip-level interconnect standards, 77 Chip MultiProcessors (CMPs), 3, 139, 219 chk_data program, 18, 19, 21, 24 cipher key, 237 classify instruction class, 73 Click elements, 70, 71, 73 Click Modular Router, 248 clock compensation, 60–61 clock recovery, 60–61 clustering algorithm, 281 CMOS technology, 279 CMPs (Chip MultiProcessors), 3, 139, 219 CMTS (cable-modem termination systems), 2 coarse-grain parallelism, 10 code generation (CG) component, 162 code memory, 69 code-reordering, 23 code-scheduling, 23 cold caching, 44 column caching, 29 communication interface, 59 compiler-generated multithreaded code, 93 compute-bound processes, 112 configuration space, 59, 60–61 control data flow graph, 71–72 control dependencies, 286 Control Flow Optimization (CFO), 164 control packets, 66 control plane interface, 55 control processor, 9 control stores, 11–12 co-processing identification, 301 co-processors, 55, 119 core clock, 87 core kernel, 124, 128 Counter Mode (CTR), 130 CPI (cycles per instruction), 95 CRC (cyclic redundancy check), 18, 19, 21, 24, 60–61, 63, 68, 74, 253
cryptography algorithms, 131 CSIX interface, 56 CTR (Counter Mode), 130 ctrl hasRoom (Tx) instruction class, 73 ctrl newCredit (Tx) instruction class, 73 custom simulator, 18 CWC (Carter-Wegman + Counter dual-use mode), 130 cycles per instruction (CPI), 95 cyclic data-flow graphs, 146 cyclic redundancy check (CRC), 18, 19, 21, 24, 60–61, 63, 68, 74, 253 DAG clustering, 287 data compression, 238–39 data dependencies, 285 data link layer (DLL), 58, 63 data packets, 66 data processors, 9 data transfer operations, 70 DC method, 238 “dead code”, 285 Defrag-1, 107 Defrag-2, 107 deframer, 63 deframer instruction class, 73 delay-optimization, 271–72, 273 dependency matrix, 288 Depth application, 140, 141 dequeue operations, 70 descr dequeue instruction class, 73 descr enqueue instruction class, 73 DES program, 18, 19, 21, 24 destination addresses, 77 Detection-1, 107, 112 Detection-2, 107, 113 detection step, 102 “dictionary”-based compression algorithm, 238 digital division, 16 DineroIV cache simulator, 17 direct-mapped associative segments, 28 direct-mapped cache, 14 direct memory access (DMA), 6, 82, 87–88, 90 divide-and-conquer methodology, of system design, 150 DLL (data link layer), 58, 63 DMA (direct memory access), 6, 82, 87–88, 90 DNA analysis algorithms, 198 DNAdb, 199, 204, 211
INDEX
311 DNA nucleotides, 199 DNA processing fields, 204 DNA queries, using network processors for, 197–218 architecture, 198–210 Aho-Corasick algorithm, 207–9 hardware configuration, 201–3 nucleotide encoding, 209–10 overview, 198 scoring and aligning, 199–201 software architecture, 203–7 implementation details, 210–11 overview, 197–98 related work, 215–18 results, 211–15 domain-specific functionality, 176 domain-specific language, 62 doorbell descriptors, 90 double-buffered aging strategies, 44 downlink, 254 DRAM chips, 121 DSL access multiplexors (DSLAMs), 2 dst(h) functions, 268 dual frequency design, 87 dual-port memory, 71 d-way set-associative cache, 37 dynamic application analysis, 284 dynamic instruction, 19–20 EBO (Extended Basic-Block Optimization), 164 ECB (Electronic Codebook), 130 ECB mode, 140 Edge weights, 285 EFQ (Estimation-based Fair Queuing), 174 Electronic Codebook (ECB), 130 embedded devices, 173 end-to-end encrypted traffic, 100 energy optimization, 271–72, 273 enqueue operations, 70 EQ (exception/event queue), 87 Estimation-based Fair Queuing (EFQ), 174 exception/event queue (EQ), 87 execution mode, 120 exemplary physical (PHY) layers, 251 Extended Basic-Block Optimization (EBO), 164 fail message, 207 failure function ( f ), 207–8 fast-path functionality, 9
feedback-based encryption modes, 131 FFT program, 18, 19, 21, 24 fibcall program, 18, 19, 21, 24 Field-Programmable Gate Arrays (FPGAs), 104, 106, 109 FIFO buffer, 255 final_round kernels, 124–25 firewalls, 101 fixed-size control store, 9, 30 flits, 256–57 flow classification, 294–98, 301, 304 control, 60–61, 73, 77–78 defined, 221 identifiers, 35, 41 processing graph, 249 segments, 249, 269 flow ctrl instruction class, 73, 74 flow ctrl update (Rx) instruction class, 73 flow filter, 185 flow-level parallelism, 148 four-way set-associative cache, 44 four-way set-associative segments, 28 FPGA co-processor, 106, 109–10 FPGAs (Field-Programmable Gate Arrays), 104, 106, 109 FPL (Functional Programming Language), 158 fragmented memory hierarchy, 162–63 framer instruction class, 73 framing/deframing, 60–61 free-list memory space, 70 Functional Programming Language (FPL), 158 generalized processor sharing (GPS), 255 general-purpose registers (GPRs), 69 GigaNetIC project, 246, 247 global layers, 252, 262, 271 Gnu gdb, 17 goto function (g), 207–8 GPRs (general-purpose registers), 69 GPS (generalized processor sharing), 255 GPS scheduler, 259 GreedyPipe, 3, 220–21, 225–28, 242–43 basic idea, 225–26 NP example design results, 239–44 overall algorithm, 226–27 overview, 225 performance, 227–28 pipeline design with, 228–32
INDEX
312 hardware multithreading, 148 hardware platform for network intrusion detection and prevention, 99–118 design rationales and principles, 100–104 characterization of NIDS components, 101–3 hardware architecture considerations, 103–4 motivation for hardware-based NNIDS, 100–101 overview, 100 evaluation and results, 110–18 functional verification, 111 micro-benchmarks, 111–14 overview, 110 system benchmarks, 114–18 overview, 99–100 prototype NNIDS on network interface, 104–10 hardware platform, 104–6 network interface to host, 107–9 overview, 104 pattern matching on FPGA co-processor, 109–10 reusable IXP libraries, 110 Snort hardware implementation, 106–7 hardware timers, 69 hashed flow identifiers, 37 hash tables, 35, 95 hasroom( ) method, 71 header fields, 69 header processing applications (HPA), 295 heterogeneous chip multiprocessors, 10 heterogeneous network processor architectures, 279–308 ADAG clustering using maximum local ratio cut, 287–91 clustering problem statement, 288–89 maximum local ratio cut, 290–91 MLRC complexity, 291 overview, 287–88 ratio cut, 289–90 ADAG results, 291–301 applications, 294–95 basic block results, 295–96 clustering results, 296–98 identification of co-processor functions, 299–301 overview, 291 PacketBench tool, 291–94
application analysis, 283–87 ADAGs, 285, 287 application parallelism and dependencies, 285–87 overview, 283 static versus dynamic analysis, 284–85 mapping application DAGS to NP architectures, 302–8 mapping algorithm, 303–4 mapping and scheduling results, 304–8 overview, 302 problem statement, 302–3 overview, 279–82 related work, 282–83 high-level stream program, 121 histograms, 74 home access router, 175 HPA (header processing applications), 295 Hypertransport, see RapidIO, Hypertransport, and PCI-Express (network processor interface for) IDS (intrusion detection system) decision engine, 104 ILP (integer linear program), 261 Imagine architecture, 122, 134, 136, 140, 141 Imagine Peak application, 140 implicit threading model, 156 inbound acknowledge packets, 65 inbound doorbell queue (DBQ), 87 inbound transactions, 63, 68 input packet bursts, 256 instruction cache, 10, 89 instruction fetch throttle, 17 instruction histograms, 74 instruction register (IR), 88 integer linear program (ILP), 261 Intel IXP 1200 processor, 105–6 Intel XScale® processor, 162 inter-actor communications conduits, 153 inter-cluster communication network, 128 intermediate representation (IR), 150 internal transmit buffers, 90 “Internet worm”, 197 inter-processor communication (IPC) mechanisms, 149 inter-thread conflicts, 27 intra-thread conflicts, 23 intrusion detection and prevention, see hardware platform for network intrusion detection and prevention
INDEX
313 invalid bit sequence, 16 invalid route nodes, 233 I/O adapters, 57 I/O bandwidths, 10 IPC (inter-processor communication) mechanisms, 149 IP forwarding algorithm, 131–32 ipp methods, 250, 251, 252–53, 269 IPSec encryption, 295, 298 IPV4 forwarding, 120, 131–39 design methodology and implementation details, 132–34 experiments, 134–38 cluster statistics, 137–38 overview, 134 performance for real packet traces, 138 varying number of clusters, 136–37 varying size of input stream (buffer), 134–36 overview, 131–32 performance summary, 138–39 IPv4-radix application, 294, 298 IPv4-trier application, 294, 298 isort program, 18, 19, 21, 24 IXP2000 family micro-engine, 50 IXPBlast, 201, 204–5, 210–11 IXP libraries, reusable, 110 IXP simulator, 213 key_expansion kernel, 125, 128 key schedule, 237 lanes, 59, 60–61 ldi (load immediate) operation, 70 ldr (load word) operation, 70 leaf nodes, 233–34 LEDA simulator, 268 libpcap module, 106 link interfaces, 261, 265 link-level error recovery, 66 link-to-link communication protocols, 55 LM (local memory), 162–63 Load immediate (ldi) operation, 70 Load word (ldr) operation, 70 locality segment-sizing strategy, 20 locality strategy, 14, 19 local layers, 252, 261 local memory (LM), 162–63 Local Register File (LRF), 121 logical layers, 65
Logical operations (L), 69 logical ports, 261, 269 Longest Prefix Match (LPM), 221, 232, 233–36 long links, 254, 270–71 look-up-table (LUT), 255 loop optimizations, 164 LRF (Local Register File), 121 LRU set-associative cache, 35 ludcmp program, 18, 19, 21, 24 LUT (look-up-table), 255 LZW method, 238 machine-dependent optimizations, 162 machine-independent optimizations, 162 MAC (medium access control) layers, 251 MAE West Routing table, 134 maintenance threads, 93 main thread, 206 MANETs (mobile ad hoc networks), 246 matmul program, 18, 19, 22, 25 maximum local ratio cut (MLRC), 4, 288, 290, 296–98, 306 ME1 –ME5 micro-engines, 207, 210 ME code generator, 164 medium access control (MAC) layers, 251 memory interfaces, 55, 112, 275 memory-mapped window, 210 memory transfer registers, 51 MEs (micro-engines), 159, 201–2 message-passing semantics, 56 meta-data, 155 micro-engines (MEs), 159, 201–2 microflow, 179, 188 MixColumn Transformation, 237 mixed real-time workloads in multithreaded processors, supporting, see multithreaded processors, supporting mixed real-time workloads in MLRC (maximum local ratio cut), 4, 288, 290, 296–98, 306 mobile ad hoc networks (MANETs), 246 modulo operator, 16 molecular biologists, 198 MPEG2 application, 140 MTAP (Multi-Threaded Array Processing) architecture, 139 multi-level cache, 41 multi-level memory hierarchy, 149 multiple link interfaces, 265 multiple-pipeline environments, 3–4
INDEX
314 multiprocessor SoCs, resource efficient network processing on, 245–77 design space exploration example, 268–77 application and system parameters, 268–71 overview, 268 results, 271–77 estimating resource consumption, 264–68 mapping application to system, 261–64 modeling packet-processing systems, 249–55 flow processing graph, 249–53 overview, 249 SoC architecture, 253–55 overview, 245–47 related work, 247–49 scheduling, 255–61 forwarding flow segments between PEs, 256–57 overview, 255–56 processing flow segments in PEs, 257–59 scheduling example, 259–61 Multi-Threaded Array Processing (MTAP) architecture, 139 multi-threaded packet-processing engine, 6 multithreaded processors, 3 multithreaded processors, supporting mixed real-time workloads in, 9–31 experimental evaluation, 17–29 benchmark programs and methodology, 17–18 overview, 17 profile-driven code scheduling to reduce misses, 23–25 segment sharing, 27–29 segment sizing, 18–22 sources of conflict misses, 22–23 using set-associativity to reduce misses, 25–27 future work, 30–31 instruction delivery in NP data processors, 11–13 fixed-size control store, 11–12 overview, 11 using cache as fixed-size control store, 12–13 overview, 9–11 related work, 29–30 segmented instruction cache, 13–17
address mapping, 16 enforcing instruction memory bandwidth limits, 17 implementation, 14–16 overview, 13 segment sizing strategies, 14 multithreading, 10, 258 MyD88 query, 214 Nack packet, 73–74 NASA Ames Internet exchange (AIX), 127 NAT (networkaddress translation), 279 NCL (Network Classification Language), 158 NEPAL framework, 249, 282 NetAMap (Network Application Mapper), 268 networkaddress translation (NAT), 279 Network Application Mapper (NetAMap), 268 Network Classification Language (NCL), 158 network interface card (NIC), 81, 101 network intrusion detection and prevention, see hardware platform for network intrusion detection and prevention network intrusion detection systems (NIDS), 99, 104–10 characterization of components, 101–3 hardware platform, 104–6 network interface to host, 107–9 overview, 104 pattern matching on FPGA co-processor, 109–10 reusable IXP libraries, 110 Snort hardware implementation, 106–7 network node intrusion detection system (NNIDS), 5, 99–101 Network-on-Chip (NoC), 247–48 Network Processing Forum (NPF), 139 network processors, 1–8 applications, 5–8 architecture, 3–4 overview, 1–3 tools and techniques, 4–5 network processor unit (NPU), 197, 204 network security, 7 network traffic processing, 284 NFA (non-deterministic finite automata), 110 NIC (network interface card), 81, 101 NoC (Network-on-Chip), 247–48
Y FL
INDEX
M A E T
315 non-deterministic finite automata (NFA), 110 non-real-time threads, 3 NP architectures, mapping application DAGS to, 302–8 mapping algorithm, 303–4 mapping and scheduling results, 304–8 overview, 302 problem statement, 302–3 NP-Click, 282 NPF (Network Processing Forum), 139 NPU (network processor unit), 197, 204 NSS Group, 114 OCB (Offset Codebook) mode, 130 off-chip DRAM, 76 off-chip memory, 75, 219 off-chip SRAM, 75, 76 offload engine, 94 Offset Codebook mode (OCB), 130 on-chip RAM, 75 on-chip SRAM, 75 opp methods, 250, 251–52 OSI/ISO reference model, 57 outbound acknowledge packets, 63–65 outbound completion queue (CQ), 87 outbound transactions, 63, 66–68 packet assembly/disassembly, 60–61 PacketBench tool, 291–94 packet classification with digest caches, 33–52 chapter authors’ approach, 35–42 case for an approximate algorithm, 36 dimensioning digest cache, 37 exact classification with digest caches, 41–42 overview, 35–36 specific example of a digest cache, 39–41 theoretical comparison, 37–39 evaluation, 42–48 overview, 42–44 reference cache implementations, 44–45 results, 46–49 hardware overhead, 49–51 future designs, 50–51 IXP overhead, 49–50 overview, 49
overview, 33–34 related work, 34–35 packet data structures, 160 packet decoding, 102 packet descriptor, 70 packetized general processor sharing (PGPS), 256 packet-level parallelism, 106, 119 packet management functions, 293 packet-oriented communication protocols, 57 packet-processing engines, 68, 70 packet-processing functions (PPFs), 150, 153–54, 283 packet processing library (PPL), 253, 268 packet-processing modeling technique, 249 packet processing on SIMD stream processor, 119–44 AES encryption, 122–31 design methodology and implementation details, 123–25 experiments, 125–30 performance summary, 130–31 background: stream programs and architectures, 120–22 future work, 140–44 IPV4 forwarding, 131–39 design methodology and implementation details, 132–34 experiments, 134–38 overview, 131–32 performance summary, 138–39 overview, 119–20 related work, 139–40 packet-processing systems, programming environment for, 145–72 design details and challenges, 152–72 Baker language, 152–58 overview, 152 profile-guided, automated mapping compiler, 158–64 runtime system, 164–72 overview, 145–47 problem domain, 147–50 network processor and system architectures, 148–49 overview, 147 packet-processing applications, 147–48 solution requirements, 149–50 packet throughput, 224 packet traces, 158, 166
INDEX
316 parameterizable network-processing unit, 246 pareto-optimal system, 273, 274 Partition Cycle Number, 232 partitioned register classes, 163 part route nodes, 233 path-thread, 190–92 pattern-matching unit, 110 payload dequeue instruction class, 73, 74 payload enqueue instruction class, 73, 74 payload processing applications (PPA), 295 payload queues, 70 PCI-Express, see RapidIO, Hypertransport, and PCI-Express (network processor interface for) PCI mezzanine connector (PMC), 104 peer-to-peer communication, 56 Pentium processor, 203 per-flow routing information, 39 PERFMON simulator, 268 PE(s,dst) functions, 268 PE(s,src) functions, 268 pfA packet flows, 252 PGPS (packetized general processor sharing), 256 PGPS scheduler, 259 physical interface, 59 Physical-Layer Initiative (UXPi), 57 physical layer (PHY), 56, 58, 65, 259 physical link, 59 physical ports (PPs), 255, 261, 262, 270 pipeline compiler, 159–61 pipeline task scheduling on network processors, 219–44 Greedypipe algorithm, 225–28 basic idea, 225–26 overall algorithm, 226–27 overview, 225 performance, 227–28 pipeline design with, 228–32 network processor problem, 232–44 AES encryption— pipelined implementation, 236–38 data compression— pipelined implementation, 238–39 Greedypipe NP example design results, 239–44 longest prefix matching (LPM), 233–36 overview, 232–33 overview, 219–21
pipeline task assignment problem, 221–24 notation and assignment constraints, 221–23 overview, 221 performance metrics, 223–24 related work, 224 PMC (PCI mezzanine connector), 104 Portland Research and Education Network (PREN), 42 PPA (payload processing applications), 295 PPFs (packet-processing functions), 150, 153–54, 283 PPL (packet processing library), 253, 268 PPs (physical ports), 255, 261, 262, 270 PREN (Portland Research and Education Network), 42 preprocessing, 102 private on-chip control store, 9 processing engine (PE), 253, 265–67, 269 processor simulator, 293 profile-guided, automated mapping compiler, 158–64 aggregate compiler, 162–64 overview, 158 pipeline compiler, 159–61 profiler, 158–59 profile strategy, 14, 19 profiling click elements, 73–74 program-sizing strategy, 18 protocol layers, 249, 272 qsort program, 18, 19, 22, 25 queue management, 70 queuing system, 102 qurt program, 18, 19, 22, 25 RAL (resource abstraction layer), 152, 166 Rambus, 57 RAMs (random access memories), 11 random number generator (RNG), 253 RapidIO, Hypertransport, and PCI-Express (network processor interface for), 55–80 architecture evaluation, 68–80 discussion, 76–80 mapping and implementation details, 70–71 micro-architecture model, 69 overview, 68–69 profiling procedure, 71–72 results, 72–76 simplified instruction set with timing, 69–70
INDEX
317 common tasks, 59–68 Click for packet-based interfaces, 61–62 Hypertransport, 66–68 overview, 59–61 PCI Express, 62–65 RapidIO, 65 interface fundamentals and comparison, 57–59 common tasks, 59 functional layers, 57–59 overview, 57 system environment, 59 overview, 55–57 ratio cut, 288 real-time calculus, 4 Real-time Network Operating System, see RNOS real-time threads, 3 register aggregates classes, 163 register-to-register instruction, 69 resource abstraction layer (RAL), 152, 166 resource adaptation, 166–67 resource-awareness, 166 retransmission timeout (RTO) mode, 36 RISC processors, 201 RNG (random number generator), 253 RNOS (Real-time Network Operating System), 173–95 analysis model of, 175–87 application model, 176–78 calculus, 182–87 input model, 179–81 overview, 175–76 resource model, 181–82 implementation model of, 187–92 implementation, 189–92 overview, 187–88 path-threads, 188 scheduler, 188–89 measurements and comparison, 192–93 outlook, 193–95 overview, 173–74 scenario, 174–75 Round Key Addition, 237 round-robin scheduling policy, 28 round transformation, 237 route lookups, 33 RTO (retransmission timeout) mode, 36 RTP (Real-time Transport Protocol), 178 RTS (runtime system), 152
runtime adaptation, 167 runtime system (RTS), 152 SAHNE simulator, 253, 268 SAN (Storage Area Network), 123 S-box, 123 Scout OS, 173–74 scrambling/descrambling, 60–61 scratch memory, 202, 203 scratchpad, 49, 124, 128 SDKs (software development environments), 280 SDRAM, 49, 203 Sector-MAC layers, 272 security gateway router, 127 security policy, 101 segmented instruction caches, 10, 30 select program, 18, 19, 22, 25 sequence number (SN), 65 serialization/deserialization, 60–61 service level agreements (SLA), 179–80 set-associative digest cache, 42 set-associative hash tables, 35 SHA-1 hash, 44 Shangri-La environment, 4, 5, 150–52, 156–57, 158 shift/mask operations, 209 shiftRow Transformation, 237 SIMD (single-data stream), 3–4, 119–20, 141; see also packet processing on SIMD stream processor SimplePipe, 236 SimpleScalar, 238, 293 simultaneous multithreading (SMT), 139 single-chip multiprocessors, 279 single-data stream, see packet processing on SIMD stream processor; SIMD single-instruction stream, 3–4 site-specific configuration policy, 106 SLAs (service level agreements), 179–80 slow clock, 87 slow-path functionality, 9 SmartBits, 192 SMT (simultaneous multithreading), 139 Snort software, 5, 102, 106–7, 110, 216 SoCs (system-on-chips), see multiprocessor SoCs software development environments (SDKs), 280 special-purpose instructions, 90 spill thread, 206
INDEX
318 SPI (System Packet Interfaces), 56 Sprint network router, 236 SRAM, 75, 76, 131–33, 203, 206 src(h) functions, 268 SRF (Stream Register File), 121 static application analysis, 284 static bottom level, 304 static instruction, 19–20 static profiling, 68, 71 Storage Area Network (SAN), 123 store-and-forward architecture, 128 Store word (str) operation, 70 stream-level flow diagram, 124 stream programming model, 141 Stream Register File (SRF), 121 string-matching algorithm, 207 striping/un-striping, 60–61 StrongARM processor, 104, 105, 107, 201, 203 sub-micron fabrication technologies, 248 switch boxes (SBs), 246, 247–48, 254, 265, 269, 270 switch fabric interface, 55 Synopsys Design Analyzer, 270 System & Circuit Technology research group, 246 system model, 152 system-on-chips, see multiprocessor SoCs System Packet Interfaces (SPI), 56 system stress tests, 116 System SYS component, 269 TaFlTx (TX transaction layer), 63 task trees, 176–77 TCAM, 51 TCB cache, 90 TCB (transmission control block), 87 TCP connection context, 6 tcpdump format, 294 TCP/IP functions, 108 TCP/IP processing, 5–6, 90 TCP offload engine (TOE), 5–6, 81–98 architecture of TOE solution, 87–94 architecture details, 87–92 overview, 87 TCP-aware hardware multithreading and scheduling logic, 92–94 inbound packets, 90
instruction set, 92 overview, 81–83 performance analysis, 95–98 requirements on TCP offload solution, 83–86 TCP SYN flags, 36 TCP (Transmission Control Protocol), 81 temporal conflict, 23 thrashing, 35 thread cache, 89 thread-switching, 93, 96 three-processor pipeline, 236 tight coupling, 77 TOE, see TCP offload engine (TOE) transaction interface, 59 transaction layer (TA), 58 transmission control block (TCB), 87 Transmission Control Protocol (TCP), 81 transport layer, 65 tstdemo program, 18, 19, 22, 25 two-stage pipeline, 236 TX transaction layer (TaFlTx), 63 UDP header, 210 uniprocessor cache misses, 22 uni-processor simulator, 286 uni-processor systems, 281 unsecured information, 236 uplink, 254, 261 valid bit sequence, 16 valid route nodes, 233 variable-sized packets (Var), 126–27, 226 VHDL simulator, 268 virtual flow segments, 251 virtual private networks (VPNs), 279 Virtual Silicon memory, 271 VLIW clusters, 121 VLIW scheduler, 137 voice channels, 185 Voice over IP (VoIP), 174–75, 178 voip-data-mux task, 178 wire-speed processing, 82, 87 Workbench Simulator, 111 worst-case buffer, 272 worst-case execution time (WCET), 29, 189