CryptoGraphics Exploiting Graphics Cards for Security
Advances in Information Security Sushil Jajodia Consulting Edito...
79 downloads
1326 Views
6MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
CryptoGraphics Exploiting Graphics Cards for Security
Advances in Information Security Sushil Jajodia Consulting Editor Center for Secure Information Systems George Mason University Fairfax, VA 22030-4444 email: jajodia @ smu. edu The goals of the Springer International Series on ADVANCES IN INFORMATION SECURITY are, one, to establish the state of the art of, and set the course for future research in information security and, two, to serve as a central reference source for advanced and timely topics in information security research and development. The scope of this series includes all aspects of computer and network security and related areas such as fault tolerance and software assurance. ADVANCES IN INFORMATION SECURITY aims to publish thorough and cohesive overviews of specific topics in information security, as well as works that are larger in scope or that contain more detailed background information than can be accommodated in shorter survey articles. The series also serves as a forum for topics that may not have reached a level of maturity to warrant a comprehensive textbook treatment. Researchers, as well as developers, are encouraged to contact Professor Sushil Jajodia with ideas for books under this series.
Additional titles in the series: UNDERSTANDING INTRUSION DETECTION THROUGH VISUALIZATION by Stefan Axelsson; ISBN-10: 0-387-27634-3 HOP INTEGRITY IN THE INTERNET by Chin-Tser Huang and Mohamed G. Gouda; ISBN10: 0-387-22426-3 PRIVACY PRESERVING DATA MINING by Jaideep Vaidya, Chris Clifton and Michael Zhu; ISBN-10: 0-387- 25886-8 BIOMETRIC USER AUTHENTICATION FOR IT SECURITY: From Fundamentals to Handwriting by Claus Vielhauer; ISBN-10: 0-387-26194-X IMPACTS AND RISK ASSESSMENT OF TECHNOLOGY FOR INTERNET SECURITY.'Enabled Information Small-Medium Enterprises (TEISMES) by Charles A. Shoniregun; ISBN-10: 0-387-24343-7 SECURITY IN E-LEARNING by Edgar R. Weippl; ISBN: 0-387-24341-0 IMAGE AND VIDEO ENCRYPTION: From Digital Rights Management to Secured Personal Communication by Andreas Uhl and Andreas Pommer; ISBN: 0-387-23402-0 INTRUSION DETECTION AND CORRELATION: Challenges and Solutions by Christopher Kruegel, Fredrik Valeur and Giovanni Vigna; ISBN: 0-387-23398-9 THE AUSTIN PROTOCOL COMPILER by Tommy M. McGuire and Mohamed G. Gouda; ISBN: 0-387-23227-3 Additional information about http://www.springeronline.com
this
series
can
be
obtained
from
CryptoGraphics Exploiting Graphics Cards for Security
by
Debra L. Cook Angelos D. Keromytis Columbia University NewYork, USA
Springer
Debra L. Cook Department of Computer Science 450 Computer Science Building Columbia University 1214 Amsterdam Avenue, M.C. 0401 New York, NY 10027-7003
AngelosD. Keromytis Department of Computer Science 450 Computer Science Building Columbia University 1214 Amsterdam Avenue, M.C. 0401 New York, NY 10027-7003
Library of Congress Control Number: 2006925092 CRYPTOGRAPHICS: Exploiting Graphics Cards for Security by Debra L. Cook and Angelos D. Keromytis ISBN-13: 978-0-387-729015-7 ISBN-10: 0-387-29015-X e-ISBN-13: 978-0-387-34189-7 e-ISBN-10:0-387-34189-7 Printed on acid-free paper.
© 2006 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now know or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks and similar terms, even if the are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. 9 8 7 6 5 4 3 2 1 springer.com
Contents
List of Figures List of Tables Preface Acknowledgments
ix xi xiii xv
1. INTRODUCTION
1
1.1 Overview
1
1.2 GPUs
3
1.3 Motivation
3
1.4 Encryption in GPUs
4
1.5 Remotely Keyed CryptoGraphics
5
1.6 Related Issues
5
1.7 Extensions
6
1.8 Conclusions
6
2. GRAPHICAL PROCESSING UNITS 2.1 Overview
9 9
2.2 GPU Architecture
10
2.3 GPUs and General Purpose Programming
15
2.4 APIs
17
2.5 OpenGL and Pixel Processing
19
2.6 Representing Data with Vertices
22
2.7 Non-Graphic Uses of GPUs
23
vi
CRYPTOGRAPHICS
3. MOTIVATION
25
3.1 Overview
25
3.2 Accelerating Cryptographic Processing 3.2.1 Issue 3.2.2 Previous Approaches 3.2.3 Summary of the GPU-Based Approach
25 25 26 27
3.3 Malware and Spy ware 3.3.1 Issue 3.3.2 Motivating Applications 3.3.3 Other Related Work 3.3.4 Summary of the GPU-Based Approach
28 28 28 30 33
3.4 Side Channel and Differential Fault Analysis
33
4. ENCRYPTION IN CPUS 4.1 Overview
37 37
4.2 Feasibility of Asymmetric Key Ciphers
38
4.3 Feasibility of Symmetric Key Ciphers
40
4.4 Modes of Encryption
45
4.5 Example: AES 4.5.1 AES Background 4.5.2 AES in OpenGL 4.5.3 AES Experiments 4.5.4 Use of Parallel Processing in Attacks
48 48 53 58 64
4.6 GPUs and Stream Ciphers 4.6.1 Overview 4.6.2 Experiments
64 64 65
4.7 Conclusions
67
5. REMOTELY KEYED CRYPTOGRAPHICS
69
5.1 Overview
69
5.2 Keying of GPUs
69
5.3 Prototype 5.3.1 Purpose 5.3.2 Architecture 5.3.3 Implementation
72 72 72 74
5.4 Design Decisions 5.4.1 Remote Keying 5.4.2 Decryption of Data in the GPU
78 79 80
Contents
vii
5.5 Experiments
82
5.6 Conclusions
87
6. RELATED ISSUES
89
6.1 Overview
89
6.2 Protecting User Input
89
6.3 Keying the GPU
90
6.4 Attacks
93
6.5 Trusted Platform Module
95
6.6 Data Compression
97
7. EXTENSIONS 7.1 Overview
99 99
7.2 Graphics-based Cipher
99
7.3 Encryption within DSPs
101
8. CONCLUSIONS
103
8.1 Summary
103
8.2 Suggested Projects
105
Appendices A AES OpenGL Code for Encryption
107 107
A.l Overview
107
A.2 Version Using the Red Pixel Component and the Back Buffer
107
A.3 Version Using the RGB Pixel Components and the Front Buffer
116
References
131
Index
139
List of Figures
2.1 2.2 2.3 2.4 3.1 4.1 4.2 4.3 4.4 4.5 4.6 4.7 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.1 6.2
High Level View of GPU Hardware GPU's Main Processing Steps OpenGL Version 2.0 General Pipeline OpenGL Pipeline for Pixel Processing Various Attack Points for Phishing ECB Encryption Mode CBC Encryption Mode CTR Encryption Mode OFB Encryption Mode CFB Encryption Mode Layout of Data in Pixel Coordinates used in the OpenGL Version of AES Encryption of 300 Identical Blocks in RGB Components Malware on Untrusted Client with OS-based Decryption Malware on Untrusted Client with GPU-based Decryption Architecture for Remotely Keyed Decryption in the GPU Remotely Keyed Decryption in GPU Protocol Encrypted Image Received by GPU Decrypted Image Displayed in GPU Decryption Rates: All Entities on a Single System Decryption Rates: Dedicated Lan and Client 1 Decryption Rates: Shared Lan and Client 2 Graphical Keypad for Digits Graphical Keypad for Hex Values
11 12 13 20 29 45 46 46 47 48 59 60 70 71 73 76 77 78 84 85 86 91 92
List of Tables
4.1 4.2 4.3 4.4 4.5 4.6
AES S-Box for Encryption AES S-Box for Decryption Encryption Rates for AES XOR Rate Using System Resources (CPU) XOR Rate Using CPUs - RGB Pixel Components XOR Rate Using CPUs - RGBA Pixel Components
50 51 63 66 66 66
Preface
CryptoGraphics: Exploiting Graphics Cards for Security explores the potential for implementing ciphers within graphics processing units (GPUs), and describes the relevance of GPU-based encryption and decryption to the security of applications involving remote displays. As the processing power of GPUs increases, researchers have started to study the use of GPUs for general purpose computing. While GPUs do not support the range of operations found in CPUs, their processing power has grown to exceed that of CPUs and their designs are evolving to increase their programmability. GPUs are especially attractive for applications requiring a large quantity of parallel processing. This work extends such research by considering the use of GPUs as a parallel processor for encrypting (and decrypting) data. The authors examine the operations found in symmetric and asymmetric key ciphers to determine if encryption can be programmed in existing GPUs. While certain operations make it impossible to implement some ciphers in a GPU, the operations used in most block ciphers, including the Advanced Encryption Standard (AES), can be performed in GPUs. A detailed description and code for a GPU-based implementation of AES is provided. The feasibility of GPU-based encryption allows the authors to explore the use of a GPU as a trusted system component, motivated by the use of thin-client and remote conferencing applications on untrusted or untrustworthy systems. By enabling encryption and decryption in GPUs, unencrypted display data can be confined to the GPU to avoid exposing it to any malware running on the operating system. The authors describe a prototype implementation of GPUbased decryption for protecting displays exported to untrusted clients. Issues and solutions related to fully securing data on untrusted clients, including the protection of user input, are also discussed. Additional capabilities are constantly being added to GPUs: when the first experiments described in this book were performed, programmable pixel processors were a new feature. Improved programmability of GPUs will likely
xiv
CRYPTOGRAPHICS
remove some of the limitations encountered when implementing ciphers to run in GPUs within the next couple of years, while other limitations are not likely to be addressed as long as GPUs are not designed or marketed for general purpose processing. While the capabilities of GPUs are growing, the concepts and proposed architectures described within this book are independent of the changes in GPUs and will only become easier to implement as the general programmability of GPUs evolves.
Acknowledgments
The authors jointly wish to thank John loannidis for suggesting the idea of performing encryption in a GPU which lead to this work and Ricardo Baratto for providing information on thin clients. Eran Tromer pointed out that moving encryption into GPUs can be a preventive measure against some existing side channel attacks on block ciphers. Angelos Keromytis also wishes to thank his wife Elizabeth for her patience and understanding, as well as her careful reading of drafts of this manuscript.
Chapter 1 INTRODUCTION
1.1
Overview
The focus of this book is the use of graphics processing units (GPUs) for cryptographies operations, hence the term CryptoGraphics. The computing power of GPUs has increased substantially over the past several years to the point that GPUs are more efficient than CPUs for certain tasks. As a result, even though GPUs are not intended to be general purpose processors, researchers have begun to study the use of GPUs for non-graphics applications. In most cases, the goal is to increase the rate at which computations can be performed by an application by using the GPU for specific types of calculations. Applications that are well suited to run in a GPU use data representations and types that are compatible with the GPU's abstraction of pixels. Compatible computations involve operations that take a single pixel's value, apply a simple function to it and output the result as a new pixel value. Parallel processing on multiple data sets can be performed by using multiple sets of pixels to represent the data sets and by applying the application simultaneously to each set and/or by treating each color component of a pixel as a separate set of data and applying the algorithm in parallel to each color component. The potential for increased processing power was the original reason for investigating the use of GPUs for cryptographic operations. As the work evolved, other benefits emerged, such as avoiding the exposure of unencrypted data to an untrusted operating system where spyware can access it, and designing ciphers based on operations commonly found in graphics processing. Another, less obvious, benefit is that executing cryptographic operations entirely in a GPU provides a preventive measure against some existing side channel attacks and differential fault analysis on ciphers.
2
CRYPTOGRAPHICS
The work described within this book explores the possibility of implementing asymmetric key and symmetric key ciphers within GPUs, and describes the relevance of GPU-based encryption and decryption to applications involving remote displays, such as video conferencing and thin-client applications. An implementation of AES in OpenGL serves as an example of the feasibility of encrypting within a GPU. It also reflects the obstacles encountered due to limitations of GPUs and their APIs. A prototype application involving streaming video and GPU-based decryption is described to illustrate the benefits and issues of running a cipher within a GPU. Suggestions for GPU enhancements and a proposal for a GPU friendly cipher are included. In addition, methods for securing other data inputs relevant to the applications, such as keyboard input and audio, are briefly described. The relationship of this work to that of the Trusted Computing Group (TCG) is also discussed. GPU vendors are constantly increasing the capabilities of GPUs. When the first experiments described in this book were performed, programmable pixel (fragment) processors were just being added to GPUs. During the time this book was being written, the increase in supported pixel size has resulted in an increase in the amount of data that can be encrypted simultaneously, but no new capabilities became available to address the obstacles encountered when attempting to perform certain cryptographic operations within a GPU. In the next couple of years the growing programmability of GPUs and the introduction of an API that improves access to GPUs' capabilities will likely eliminate some of the obstacles encountered, but other limitations are not likely to be addressed as long as GPUs are not designed or marketed for general purpose processing. Chapter 2 provides background information on GPUs and their APIs, which will assist the reader in understanding the capabilities and limitations of using a GPU as a general purpose processor. The background information also clarifies why certain implementation decisions were made in the experiments described in Chapters 4 and 5. The motivation for the work is described in Chapter 3. The protection that GPU-based encryption and decryption provide against side channel attacks and differential fault analysis is also discussed. Chapter 4 discusses the implementation of encryption within a GPU, including an implementation of AES in OpenGL. The code for the OpenGL version of AES's encryption function is provided in Appendix A. Chapter 5 describes a prototype for encrypting displays sent to untrusted remote clients. Chapter 6 describes issues related to fully implementing a secure system based on the prototype described in Chapter 5. This includes protecting the user's inputs on the untrusted client, an option for conveying a secret key to the GPU, notes on compression of images, and the relevance of certain types of attacks to the prototype. An overview of the TCG's trusted platform module (TPM) and how GPU-based encryption can utilize the TPM is provided. Chapter 7 discusses related ideas and future work, including the encryption of audio in digital signal
Introduction
3
processors (DSPs) and designing a stream cipher to run in a GPU. Chapter 8 summarizes the work and the insights gained from the experiments. The following is an overview of each chapter.
1.2
GPUs
This chapter provides background information on GPUs and their APIs, which will assist the reader in understanding both the motivation for the work and the implementation decisions made in the experiments described in later chapters. An overview of GPUs is provided and existing APIs to GPUs are discussed. Of the APIs, the lowest level that is publicly available and is independent of the operating system is OpenGL [58]. DirectBD [51] is at the same layer as OpenGL but is Microsoft-specific.^ The experiments and implementations described within this book use only the OpenGL API. Other, more user-friendly, APIs exist that provide a user interface layer above OpenGL and Direct3D. However, these provide the programmer less control over which operations are executed in the GPU v^. in the CPU, and over the exact commands issued to the GPU. Processing in GPUs is split between operating on vertices (vertex processor) and on pixels (pixel processor). The cryptographic operations under consideration require that data be stored in and processed as pixels as opposed to vertices. This is also the case for other types of applications that have experimented with using a GPU as a general purpose processor. An explanation for why vertices cannot be used to store and process data is provided. In order to provide an understanding of what operations are provided by a GPU for cryptographic algorithms, some details on OpenGL and pixel processing are included. Finally, a few non-graphic applications utilizing GPUs in areas other than cryptography and security are mentioned to illustrate the growing use of GPUs as general purpose processors.
1.3
Motivation
This chapter describes the motivation for experimenting with the use of GPUs for performing cryptographic operations. The main reasons are accelerating the execution of cryptographic operations using commodity hardware, protecting data from spy ware and certain types of phishing attacks. The use of GPUs also eliminates the possibility of existing side channel attacks and existing differential fault analysis. Cryptographic operations serve a critical role in protecting data and in insuring the authenticity and integrity of data. The need to perform such operations without consuming shared system resources in certain environments has lead to the development of specialized cryptographic hardware. However, such hardware is not a common component of most systems. In contrast, GPUs are
4
CRYPTOGRAPHICS
a widely available and fairly inexpensive resource. GPUs offer a high level of parallel processing and have processing speeds that exceed that of CPUs (although GPUs do not offer the general processing capabilities provided by CPUs). Therefore, GPUs may serve as a viable alternative to dedicated hardware for performing cryptographic operations. Aside from leveraging the CPU's processing power, CPU-based encryption assists in protecting displays sent to untrusted systems. Software that covertly monitors user actions, also known as spy ware, has become a first-level security threat due to its ubiquity and the difficulty of detecting and removing it. Such software may be inadvertently installed by a user that is casually browsing the web, or may be purposely installed by an attacker or even the owner of a system. This is particularly problematic in the case of utility computing, early manifestations of which are Intemet cafes and thin-client computing. Chapter 5 examines the problem of protecting a user accessing specific services in such an environment. As our two example applications, the focus is on secure video broadcasts and remote desktop access when using any convenient (possibly untrusted) terminal. For such applications, confining the trusted computing base to a suitably modified GPU prevents spy ware running on the operating system from accessing the displayed data. This involves moving image decryption into GPUs. A final benefit of CPU-based decryption is that it prevents some existing side channel attacks and existing differential fault analysis. A summary of such attacks is provided. Moving encryption into the GPU prevents attacks that require access to memory used by the encryption algorithm, attacks that measure CPU usage, or attacks that require the ability to introduce specific faults into the software or hardware. Existing results concerning the potential for attacks that measure acoustics or power utilization, or attacks that inject flaws into the hardware will not directly work on GPUs. Conceptually, some of these types of attacks can be applied to GPUs. Experimentation is needed to determine what types of measurements can be obtained from the GPU and provide useful information to an adversary.
1.4
Encryption in GPUs
This chapter discusses the feasibility of performing encryption within GPUs based on the types of operations and data structures supported. B oth asymmetric and symmetric key ciphers are considered. A summary of common public key methods is provided along with an explanation as to why they are not suitable for implementation in existing GPUs. In contrast, common operations found in symmetric key ciphers are implementable within GPUs, with a few exceptions. To illustrate the potential for providing encryption within a GPU using a symmetric key cipher, an OpenGL implementation of AES is described in detail. The common modes of encryption for block ciphers and how they
Introduction
5
can be performed in a GPU while allowing for parallel processing of data are described. Finally, a partial implementation of a stream cipher within a GPU is discussed. This component of the work provides a basis for applying stream ciphers in GPUs.
1.5
Remotely Keyed CryptoGraphics
The applicability of GPU-based decryption to video broadcasts, such as found in desktop video conferencing applications, and remote desktop access, such as with thin-clients, is investigated. An architecture for providing GPUbased decryption for these applications is defined and a prototype for use with streaming video is described. In the prototype, a stream cipher is used for encrypting data at the server and decrypting data at the client. The secret key for the stream cipher must be known by both the server and the client's GPU. When performing decryption in a GPU, the issue of how to securely convey the secret key to the GPU must be addressed in order to avoid exposing plaintext to processes running on an untrusted operating system. Remote keying of the GPU is one solution. The secret key for the stream cipher is sent to the GPU via a proxy. A certificate stored in the GPU's memory contains a public/private key pair for the GPU. The proxy establishes a secure session with the server over which it receives the secret key. The proxy encrypts the secret key with the GPU's public key and sends it to the GPU via the client's operating system. The GPU decrypts the secret key and uses it for the stream cipher. Encrypted data is sent directly from the server to the client, where it is written to the GPU and XORed with the key stream. The purpose of the prototype is to simulate the target architecture. A few of the operations in the prototype had to be performed in the CPU instead of the GPU due to GPU limitations. The capabilities of current GPUs in supporting both decryption and the keying of a symmetric key cipher using existing key establishment protocols as well as GPU limitations in regards to such operations are identified. Enhancements to future GPUs are proposed that will allow the full realization of the defined architecture.
1.6
Related Issues
This chapter contains solutions to issues relating to protecting user input on untrusted clients, an alternative to the method described in Chapter 5 for keying the GPU, a discussion of man-in-the-middle attacks and phishing attacks as they apply to the prototype described in Chapter 5, an overview of the TCG's TPM and a discussion on the compression of images. While GPU-based encryption protects data sent from a server to an untrusted client, applications deal with more than just display updates. In thin-client applications, the user will provide
6
CRYPTOGRAPHICS
inputs via the mouse and keyboard on the untrusted client. These inputs must be sent to the server and may contain information that must be protected. An alternative to the remote keying of the GPU used in the prototype described in Chapter 5 is presented. The method involves the user selecting colors displayed to the user by the GPU to input the key. The applicability of two common types of attacks to the scenario in the prototype is also considered. The prototype's susceptibility to man-in-the-middle attacks when using a proxy is evaluated. The potential for phishing attacks when using the architecture described in the prototype is discussed. An overview of the TCG's TPM is included in this chapter because it relates to the use of the GPU as a trusted component when using GPU-based decryption to protect displays. The use of the GPU as a trusted module can be incorporated into the TCG architecture and the TPM can be used to generate keys for the GPU. A another issue is the compression of images and video when decrypting in the GPU. Ideally, an encrypted image cannot be compressed since the encryption will result in a pseudorandom bit string representation of the image. Thus, images are compressed before being encrypted and must be decrypted then decompressed. When an image is decrypted in the GPU, it should not be written back to the operating system to allow for decompression but instead should be decompressed in the GPU.
1.7
Extensions
This chapter presents extensions to the work and future areas of research. The concept of designing a cipher based on operations suitable for execution in a GPU is discussed. Ideas for how to create a GPU-based stream cipher are presented. A GPU-based cipher would not only be beneficial to applications requiring encrypted displays, but could also serve as a general purpose cipher in any system containing a GPU. The concept of encrypting and decrypting displays in a GPU to avoid exposing plaintext to the operating system can be extended to audio. The operations supported by programmable DSPs are typical of those supported by CPUs and those found in existing ciphers. This makes performing cryptographic processing of audio in a DSP substantially easier than performing such processing on images in GPUs.
1.8
Conclusions
This chapter provides a summary of the benefits and issues related to performing cryptographic operations in a GPU. Possible enhancements to GPUs that will assist in performing cryptographic operations are reviewed. A Hst of possible projects for students is included.
Introduction
Notes 1 For its latest graphics cards, the XIK series, ATI has announced plans to provide a lower level API than OpenGL and DirectSD. The new API will provide more flexibility for programmers using GPUs as general purpose processors [59].
Chapter 2 GRAPHICAL PROCESSING UNITS
2.1
Overview
Knowledge of the operations supported by GPUs and how data is processed in GPUs is necessary in order to understand how GPUs can be leveraged for cryptographic processing and protecting data. This chapter provides an overview of GPUs and their APIs. While GPUs allow for significant levels of parallel processing, the capabilities supported for graphics processing do not allow for general purpose computing within a GPU equivalent to that of a CPU. GPU capabilities continue to expand and the APIs are evolving to improve programmers' access to these capabilities, some of which can potentially assist in performing cryptographic operations. For the most recent GPU capabilities, the reader should refer to vendors' GPU specifications. This chapter is organized as follows: Section 2.2 provides a summary of the general architecture and capabilities of GPUs. The steps a GPU performs on vertices and pixels when creating an image are described. Section 2.3 provides an overview of the types of operations supported by GPUs and the limitations of GPUs when used for general purpose programming. Section 2.4 lists the common APIs available for GPUs and explains why lower level APIs are more suitable for general purpose programming of a GPU compared to higher level languages that are more user friendly. Data can be processed in GPUs as either vertices or as pixels. For cryptographic applications discussed in this book, the data must be represented as pixels. Section 2.5 describes how pixels are processed in a GPU, focusing on the operations relevant to creating ciphers that can execute within a GPU. Section 2.6 discusses why vertex processing is not appropriate for existing cryptographic algorithms. The idea of using GPUs for cryptographic processing arose in part because of the processing power of GPUs and a growing number of other applications experimenting with using
10
CRYPTOGRAPHICS
GPUs in place of CPUs. A few examples of other non-graphics applications utilizing GPUs are listed in Section 2.7.
2.2
GPU Architecture
GPUs contain their own processors and memory. A GPU is connected to the system by either an AGP, PCI or PCI Express bus. The first programmable GPUs operated as a fixed pipeline. A program compiled on the CPU issued API commands to the GPU for execution. This allowed computationally expensive operations to be performed in the GPU to free up the CPU. When programming a GPU, operations are performed on either vertices or pixels. Vertices are specified as coordinates and are the most basic element for defining any line or object. A pixel is a string of bits interpreted according to a specific format to indicate which bits represent the red (R), green (G), blue (B) and alpha (A) components. A typical configuration uses 32-bit pixes with 8 bits for each of the components. In the past two years, the flexibility in programming GPUs has substantially increased with the addition of programmable vertex and pixel (fragment) units that allow for certain programs to execute on the GPU. The term "pixel processor" will be used throughout this chapter to refer to the pixel unit. Some GPU specifications, articles and graphics books use the term "fragment processor" exclusively while others use the term "pixel processor". Vertex and pixels programs are commonly referred to as vertex and pixel (or fragment) shaders. Graphics programming generally uses vertex processing; whereas, non-graphic applications using GPUs typically require the pixel processor [62]. The basic architecture of a GPU is shown in Figure 2.1. The number of vertex and pixel processors, and how the components handling the operations and memory outside of these processors will vary per graphics card. The main point to obtain from the figure is that the GPU contains a series of vertex processors and pixel processors working in parallel. The vertex data are received over the bus from the host processor, and are processed by the vertex shaders, which include a programmable unit. Some fixed steps are performed, including rasterization, of which the output is the fragments given to the pixel processors. Both programmable and fixed steps are performed during pixel processing. The vertex processing steps are not applicable when operating on pixels directly. Bytes stored in the system's memory can be written directly to the CPU's memory and then processed as pixels. "Fixed" steps refer to steps outside the programmable units. These steps are controlled to some extent by the programmer. For example, the programmer defines stencils and depth, and parameters for the viewing angle and perspective, among others. Common components found within the vertex processor are a floating point unit, a floating point vector unit, a unit for fetching textures from the CPU's cache and a branch unit. One or more units for vertex assembly operations and viewport (mapping a 3D scene to the 2D viewing area) may be included.
11
Graphical Processing Units
Host System
I VS
vs
VS
T
culling, clipping, transformations..
I
rasterization
texture 1^ cache
PS
PS
zcull
PS
I
DRAM (partitioned memory)
VS = vextex shader PS = pixel shader Figure 2.1. High Level View of GPU Hardware
Components found within the pixel processor include a texture processing unit that communicates with the cache, one or more floating point units, a branch unit and a fog arithmetic and logic operations unit (ALU). The processing speeds of CPUs have been increasing at a rate faster than CPUs. In the last two to three years, GPUs have evolved to contain more transistors than typical desktop CPUs. Although processing speeds of GPUs now surpass those of CPUs, their capabilities are narrower in scope than those of CPUs and do not offer the general programmability provided by the latter. This is due to both API limitations and GPU capabilities. Newer GPUs process at rates exceeding 40 billionfloatingpoint operations per second (GFlops). For example, peak performance of a Nvidia GeForce 6800 ultra was listed as 40 GFlops in comparison to 6 GFlops on a Pentium 4 with a 3.2 Ghz processor [57]. The RAM in GPUs is of smaller capacity than that commonly found in systems today, with newer GPUs containing a maximum of 256 MB or 512 MB of RAM compared to the 1 GB to 4 GB of RAM available for typical desktop PCs. However, as the number of transistors per GPU (or CPU) increases, the power consumption and heat dissipation become of greater concern.
12
CRYPTOGRAPHICS
Until the year 2005, most GPUs supported 32-bit pixel formats and 32-bit floating point precision while others only supported 16-bit precision. Recently, support for 64-bit pixel formats and 64-bit floating point precision has become more common. Graphics cards with 128-bit floating point precision are becoming available.
pack/unpack pixels pixel processing commands/program
Host System (Inputs)
E "cd c
fetch
Vertex Processing
0 CO CD O
E CO
vertex program transformations lighting, clipping, projection, viewpoint
w 3
Texture Memory
Rasterization
0 Q
E V-
0)
Pixel Processing
z-culling fragment program tests, blending, logical operations...
Figure 2.2. GPU's Main Processing Steps The general flow for processing data in a GPU is shown in Figure 2.2. The general flow in OpenGL 2.0, a platform independent API for GPUs, is shown in Figure 2.3 from the OpenGL Version 2.0 Specification [75]. It is important to understand both the fixed pixel processing pipeline as well as the flow with programmable units because not all operations have been moved into the programmable units. For example, the rasterization and blending steps are outside the programmable units and most of the pixel processing in OpenGL still corresponds to the basic pipeline [39]. The implementation of the block cipher AES described in Chapter 4 uses the basic pixel processing of GPUs. GPUs can be viewed as processing data in two formats. The first and most used in graphics applications is vertex processing. Vertices are specified as sets of coordinates. Any shape or object is formed by a set of connected vertices. Once objects are defined, transformations concerning properties such as the
Graphical Processing Units
13
Display List
Evaluator
Per Vertex Operations Primitive Assembly
Pixel Operations
Rasterization
Per Fragment Operations!
Framebuffer
Texture Memory
Figure 2.3. OpenGL Version 2.0 General Pipeline
angle and direction the scene is viewed from, lighting and intensity are applied. The resulting scene is converted into fragments (pixels) and undergoes pixel processing before being displayed. The coordinates and properties of vertices, including color and location, cannot be tracked and read back to system memory as data. As a result, vertices are not a suitable means for representing data to which cryptographic operations are applied with the intent of offloading work from the CPU then supplying the result to a process running on the operating system. Even when the intent is to decrypt data in the GPU and display it to the user, with no need to transfer the data back to the operating system, vertex processing cannot be used because of the floating point representation and rounding. The rounding in GPUs, even when considering the increasing precision of 64 and 128-bit floating point values, results in a lack of accuracy unacceptable for ciphers where changing one bit will produce an incorrect decryption. Processing vertex data involves the following steps (some of these steps may be performed within the vertex program instead of the traditional pipeline): • The various data needed to construct the image (vertices, including their coordinates and colors, properties, parameters and any textures) are defined. The data is passed into the GPU through API commands. • Transformations are applied to set the vertices in the scene. The vertex coordinates (including depth) are multiplied by model and view transfor-
14
CRYPTOGRAPHICS mations. The model transformation indicates any rotation, translation or scaling of the scene; for example, rotating about the X axis by 30 degrees and doubling the scale of an object. The view transformation is the angle (or camera position) the scene is viewed from.
• Lighting is applied. This sets the angle and intensity of the light. • Clipping, projection and viewpoint are applied. Clipping removes areas outside the scene. Projection can be thought of as viewing the scene through a normal, telephoto or wide angle camera lens. It also determines if all objects appear to be of the same size or if objects that are further away are smaller than those at the front of the scene (as objects appear in real life). The viewpoint defines the shape and area of the screen where the objects will appear. • Rasterization, the converting of vertices into fragments (pixels), is performed. Texture coordinates are interpolated from the texture coordinates of the vertices. • The fragments resulting from rasterization are tested to determine which pixels to keep and which pixels to discard. The scissor test discards portions of the image outside of a defined region. The alpha test discards pixels based on their alpha values. The stencil test discards portions of the image outside of a defined stencil. The depth test discards pixels based on their depth. When a pixel program is applied, fragments that will fail the depth test are discarded before the pixel program is applied. The depth test and the other tests are applied after the pixel program. • The pixels are combined with the current contents of the buffer. The default setting is for the new pixels to overwrite the current pixels in the buffer. The pixels may instead be combined in a few ways. The current and new values may be combined by blending. The resulting value of each color component is based on both the new and current values; for example, by multiplying both the old and new value by some factors then adding the result. By default, no blending is performed. Dithering may be applied. This averages pixels with neighboring pixels to eliminate abrupt color changes. By default, dithering occurs, but can be disabled by an API command. Logical operations can also be applied, such as XORing the new and current pixel values together. Logical operations are off by default. A vertex program can replace the model and view transformations, and any pervertex lighting. The vertex program may define textures and their coordinates. Vertex programs work on a single vertex at a time, with the output continuing through the remainder of the standard pipeline. Operations that require knowledge of multiple vertices and/or of topology are performed according to
Graphical Processing Units
15
the standard pipeline. A vertex program can read textures but cannot currently read from the framebuffer. The second method for processing data in GPUs is pixel processing. Individual pixel values can be set and operated on, as opposed to drawing and manipulating objects. Pixels can be used to store and manipulate byte level data, as described later in this chapter. Pixel values can be transferred between the GPU's framebuffer and system memory (where they are stored as bytes) by executing commands from a program running on the CPU. Therefore, an application using a GPU to offload processing from the CPU can read the result from the GPU to use in a program running on the CPU. Pixel processing is described in more detail in Section 2.5. In Figure 2.2, the vertex processing steps are not applicable when dealing solely with pixels. The program executing on the system's CPU will write pixels to the framebuffer and possibly define textures, then the pixel processing steps will be performed.
2.3
GPUs and General Purpose Programming
The following is an overview of the types of operations supported by GPUs and the limitations of GPUs when used for general purpose programming. The types of applications best suited for GPUs are those that involve operations that take a single pixel's value, apply a function to it (with limitations on what the function can be) and output the result as a new pixel value. Parallel processing of data is performed by using multiple pixels and multiple color components of a pixel. Four streams of data can be operated on simultaneously in the GPU by using each of the four components of a pixel (red, green, blue and alpha — RGB A). As a general rule, data that is stored in an array when programming in the CPU should be represented as a texture in the GPU. Any loop running in the CPU should have the inside of the loop run as a kernel on the pixel processor. Complex functions that cannot be performed in the GPU can be computed in advance in the CPU and the results stored as tables {e.g., as colormaps) or textures to be used by the GPU in some cases. In order to use table lookups in the GPU to represent a function, the function must only take a single input value and the input value must be able to be stored in a color component of a pixel. If the function takes multiple inputs, it may not be possible to represent it as a table lookup on pixel values. Applications that take multiple inputs and produce a single output; applications that require pixels be processed in a particular order; or applications that require using one pixel's value to determine which operation to perform on another pixel are not suitable for implementation in a GPU. Visiting pixels in a particular order in a single pass through the pixel processor is not possible because there is no way to control the order in which pixels are processed. It is also not possible to use the results from an already processed pixel when operating on a pixel that has yet to be processed.
16
CRYPTOGRAPHICS
Pixel processors can currently perform what is referred to as scatter. Scatter is the capabiHty to output results to areas of the image other than those used as the input to the function. However, pixel processors cannot support memory accesses such as a[i] = x where i is a computed address. The operation X = a[z] is possible. If a is a texture and i is a computed value, then a[i] is a texture fetch instruction; whereas, a[i] = x requires a texture write instruction to a computed address, i. This is because in pixel processors the only writes allowed are to pre-computed fragment addresses that cannot be changed by a program running in the pixel processor. The GPU is also designed to only read texture data; whereas, a CPU is designed for read and write operations. This limits how data can be processed in a GPU compared to a CPU. A way around this is to write intermediate results to system memory then read the data back into the GPU. This increases the number of data transfers between the GPU and operating system, increasing the overall execution time. Furthermore, it makes intermediate results available to the operating system, which must be avoided for the applications addressed in this book. A second option is to use the vertex processor, which supports such indexing. This is unsuitable for applications whose data cannot be represented and processed as vertices, including cryptographic processing. Branching is also not readily supported in pixel processors, although a workaround (described by Pharr [62]) can exist for certain applications. The CPU's pixel processor is still often a single instruction, multiple data (SIMD) design without support for branching or with minimal support where both paths of the branch must be taken, slowing processing. In contrast, vertex processors are now multiple instruction, multiple data (MIMD) processors and support branching. Some CPUs, such as Nvidia's Geforce 6 Series, supports different segments of the frame taking different paths. One segment is processed at a time, causing the other to wait. Processing after the branch does not begin until both segments have finished the branch. On Nvidia's FX Series, branching is supported by evaluating both possible paths then only writing the results of the path actually taken. Another limitation is that CPUs treat all values as floating point values, including the values of pixel components. This must be carefully taken into account when the data being processed involves values that are to be interpreted as individual bits. The floating point representation results in limiting the range of integers supported and rounding error. The OpenGL version of AES in Appendix A illustrates the impact of rounding error by having to consider the impact when populating the tables used in the implementation. While the OpenGL shading language supports an integer data type, this is done for the benefit of the programmer. The values are actually stored and processed as floating point values in the hardware. In the OpenGL shading language, integers are currently limited to 16 bits plus a sign bit.
Graphical Processing Units
17
The time to read and write data to and from the GPU must be considered. The processing power of GPUs has increased faster than the data transfer rates between the system's memory and the GPU. Therefore, it is best to limit reads and writes to system memory. When the time to transfer data between system memory and the GPU is considered, functions that can be computed faster in the GPU may take longer than when computed in the CPU. In general, functions that have a large ratio of arithmetic operations compared to memory accesses may perform better on a GPU (provided the arithmetic can be done on the GPU) than a CPU. Whereas, those that require a large number of memory accesses will likely be slower on the GPU. Note, neither symmetric key ciphers nor asymmetric key ciphers fall into the category of having a large ratio of arithmetic operations. Symmetric key ciphers have simple operations repeatedly applied to the data. Asymmetric key ciphers have a few arithmetic operations requiring large data structures; for example, exponentiation involving large integers.
2.4
APIs
The two most common APIs for GPUs are OpenGL and DirectSD. OpenGL is an open source, platform independent API. In contrast, DirectSD is specific to Microsoft Windows. These APIs are the lowest level, publicly available interfaces to GPUs. There are higher level languages built on top of OpenGL and Direct3D that provide a more user-friendly syntax and hide lower level details from the programmer, but such languages provide no additional capabilities in terms of what commands can be executed in the GPU since they rely on the OpenGL and Direct3D APIs. What can be executed within a GPU is restricted to the capabilities of the GPU, which are independent of the level or type of API used. The higher level languages result in code that compiles to a combination of a program (usually C or C++ code) that executes in the CPU and issues commands to the GPU. Such languages do not allow the developer control over which commands the code is translated into or even which commands are executed in the GPU. For example, code in a higher level language that XORs two bytes will likely be transformed into code executed in the operating system rather than converted into OpenGL commands that converts the bytes to pixels and XORs pixels. Using pixels to XOR bytes produces the desired result but is an inefficient way to XOR only two bytes when the operation can easily be performed in the CPU. Using the GPU to XOR two long sequences of bytes in a single step is useful. Cg [25] , Brook GPU [8] and Vertigo [10] are some examples of higher level languages of which Cg is the oldest and the most well-known of the languages. Cg is a C-like syntax that compiles to either OpenGL or DirectSD code, depending on the platform. The Cg code must be included in a main program that compiles on the CPU, such as a C or C++
Graphical Processing Units
19
of the window system and providing a more user-friendly syntax for creating display windows than the APIs for the window systems. GLUT is closed source. Its executable is available from the OpenGL organization at: h t t p : //www. o p e n g l . o r g / r e s o u r c e s / l i b r a r i e s / g l u t . html There are several alternatives to GLUT, including open source versions such as Freeglut. A list of toolkits that provide wrappers for window systems' APIs along with links to their downloads are available at: h t t p : //www. o p e n g l . o r g / r e s o u r c e s / l i b r a r i e s / w i n d o w t o o l k i t s . h t m l . The experiments described within this book required using a low level API in order to issue commands directly to the GPU and required platform independence. Therefore, OpenGL was used in all experiments. GLUT was used to open the display windows. Further details regarding OpenGL pixel processing and vertex processing that are relevant to implementing ciphers within GPUs are provided in the next two sections. At the time this was being written, ATI announced plans to provide support for general purpose GPU programming by publishing an API that is at a lower level than OpenGL and DirectSD in order to provide more direct access to GPU's capabilities [59].
2.5
OpenGL and Pixel Processing
The following is an overview of the OpenGL pixel processing pipeline and the OpenGL commands relevant to the experiments described in subsequent chapters. The implementations used in the experiments process data as 32-bit pixels treated as floating point values, with one byte of data stored in each pixel component. When using 32 bit pixels, 1 byte is typically dedicated to each of the RGBA components. Other formats, such as 10 bits for each of the red, green and blue components and 2 bits for the alpha component may also be supported. Since the time of the experiments, support for 64-bit pixels with 16 bits for each of the color components has become available. The following capabilities are not used in the experiments described in this book and therefore, are not described here: OpenGL's capabilities of processing pixels as color and stencil indices, and OpenGL's vertex processing (refer to [58] and [89] for a complete description). Figure 2.4 shows the components of the OpenGL pipeline that are relevant to pixel processing when pixels are treated as floating point values. While implementations are not required to adhere to the pipeline, it serves as a general guideline for how data is processed. The programmable pixel processor replaces part of the pipeline. Pixel shaders can access and apply textures, compute and set colors and depth, and apply fog. As with vertex shaders, the various tests at the end of the pipeline (scissor, stencil, alpha, etc.) are performed according to the pipeline and are not programmed within the pixel processor. OpenGL requires support for at least a front buffer (image is visible) and a back buffer (image is not visible) but does not require support for the alpha pixel component in the back buffer. This limits
20
CRYPTOGRAPHICS
Texture Memory
Unpack
Pixel Storage Modes
System Memory Pack
1
r Convert to [0,1]
• ^ ^
Pixel Transfer Operations 1 and Map
RGBA, depth
Per Fragment Operations Convert to [0,1]
^
rr
Convert Luminance to RGBA Convert to Luminance Cif required'J
Legend
^V-
.^V^
system to framebuffer direction framebuffer to system direction
Figure 2.4. OpenGL Pipeline for Pixel Processing
the data representation to be three bytes per pixel (the red, green, blue components) when performing operations in the back buffer. It is worth mentioning that while a 32-bit pixel format is used in the implementations described in Chapters 4 and 5, the 32 bits cannot be operated on as a single 32-bit value, but rather is interpreted in terms of pixel components. For example, it is not possible to add or multiply two 32-bit integers by representing them as 32-bit pixels. A data format indicating such items as number of bits per pixel and the ordering of color components specifies how the GPU interprets and packs/unpacks the bits when reading data to and from system memory. The data format may indicate that the pixels are to be treated asfloatingpoint numbers, color indices, or stencil indices. The following description concerns thefloatingpoint interpretation. When reading data from system memory, the data is unpacked and converted intofloatingpoint values in the range [0,1]. Luminance, scaling and bias are applied per color component. The next step is to apply the colormap, which we describe later in more detail. The values of the color components are then clamped to be within the range [0,1]. Rasterization is the conversion of data into fragments, with each fragment corresponding to a pixel in the framebuffer. In work described here, this step
Graphical Processing Units
21
has no impact. The fragment operations relevant to pixel processing include dithering, threshold-based tests (such as discarding pixels based on the alpha value and on stencils), and blending and logical operations. These operations combine pixels being drawn into the frame buffer with those already in the destination area of the frame buffer. Dithering, which is enabled by default, must be tumed off when storing data in pixels to prevent pixels from being averaged with their neighbors and their values changed as a result. When reading data from the framebuffer to system memory, the pixel values are mapped to the range [0,1]. Scaling, bias, and colormaps are applied to each of the RGBA components and the result clamped to the range [0,1]. The components or luminance are then packed into system memory according to the format specified. When copying pixels between areas of the framebuffer, the processing occurs as if the pixels were being read back to system memory, except that the data is written to the new location in the framebuffer according to the format specified for reading pixels from system memory to the GPU. Aside from reading the input from system memory and writing the result to system memory, the OpenGL commands in the implementations described in Chapters 4 and 5 consist of copying pixels between coordinates, with colormaps and a logical operation of XOR enabled or disabled as needed. Unfortunately, the copying of pixels and colormaps are two of the slowest operations to perform [89]. The logical operation of XOR produces a bitwise-XOR between the pixel being copied and the pixel currently in the destination of the copy, with the result being written to the destination of the copy. A colormap is applied to a particular component of a pixel when the pixel is copied from one coordinate to another. A colormap can be enabled individually for each of the RGBA components. The colormap is a static table of floating point numbers between 0 and 1. Internal to the GPU, the value of the pixel component being mapped is converted to an integer value that is used as the index into the table and the pixel component is replaced with the value from the table. For example, if the table consists of 256 entries, as in the AES implementation described in Chapter 4, and the map is being applied to the red component of a 32-bit pixel with 8 bits per color component, the 8 bits of the red value are treated as an integer between 0 and 255, and the red value updated with the corresponding entry from the table. In order to implement the tables used in the OpenGL version of AES as colormaps, the tables must be converted to tables of floating point numbers between 0 and 1, and hard-coded in the program as constants. The table entries, which would vary from 0 to 255 if the bytes were in integer format, are converted to floating point values by dividing by 255. Because pixels are stored as floating point numbers and the values are truncated when they are converted to integers to index into a colormap, 0.000001 is added to the result (except to 0 and 1) to prevent errors due to truncation.
22
CRYPTOGRAPHICS
The use of floating point numbers for pixels is one example of where GPUs do not readily provide a capability required for cryptographic processing, namely the need to maintain the accuracy of byte values with no change due to rounding or truncation. There is no support for user specified conditional statements based on specific pixel values. For example, there is no way to say pixel = pixel at x,y coordinates 80,90 if (pixel's blue value == 01010101) { turn on colormap 1; > else { turn on colormap 2; > and have this code executed entirely in the GPU. The pixel's value has to be read to the operating system's memory, the comparison performed using the CPU and system memory, then a command issued to the GPU to turn on the appropriate colormap.
2.6
Representing Data with Vertices
In contrast to pixels, vertices cannot be used to store data. Intuitively, if the data being operated on is used to define vertex coordinates and the operations performed on the data result in altering the coordinates of vertices, the resulting coordinates of the vertices represent the outcome of the computation. However, there is no support for tracking individual vertices and reading their coordinates back to system memory. A vertex is defined by its coordinates, thus these must be known. A vertex cannot be referenced by some other name and the coordinates obtained as if they were a property of a vertex object. Another obstacle is that there is no means by which to define conditional statements based on the coordinates of a vertex. It may be thought that encryption of data can be performed by defining a vertex for a segment of data, such as a bit or byte, with the color of the vertex representing the data. Then a vertex program performs a series of transformations that alter each vertex's color and possibly its location, with the new color (which is just a pixel value) for each vertex representing the encrypted data. Decryption will require reversing the transformation or repeating the transformation in the case of a stream cipher. However, existing ciphers cannot be mapped to such a transformation. Rounding during the transformation (recall that all values are of floating point type) will result in a lack of precision. Just consider the vertex coordinates. The transformation must insure that calculations never result in a coordinate value that falls between two pixels to avoid indeterministic results.
Graphical Processing Units
23
For example, a vertex with an x coordinate computed to be 100.499 by one GPU and 100.500 by another will impact the pixel with coordinate 100 in the first display and 101 in the second display if the GPUs round to the nearest coordinate. The values 100.999 and 101.000 will map to x coordinates of 100 and 101, respectively if the GPUs truncate the floating point values when determining the coordinates. In summary, a program cannot create an image defined by vertices at specific locations, perform transformations on the image that alters the locations of the vertices, then read properties of a particular vertex because the coordinates of that vertex are no longer known. Specifically, if a program creates a vertex v at coordinates {xl,yl,zl), there is no general way to track the movement of ^' and at the end of the program read v's new coordinates, (x2, y2, z2), or color. (A vertex can be tracked when using only simple transformations, such as rotating a square 90 degrees.)
2.7
Non-Graphic Uses of GPUs
As the processing capabilities of GPUs increase, the idea of using GPUs for non-graphic applications is becoming common. The General Purpose Computation on GPUs (GPGPU) organization catalogs experiments that use GPUs for general purpose computing. The GPGPU website is located at http://wwww.gpgpu.org/. Nvidia's GPUGems 2 [62] devoted a section to general purpose computing using GPUs. Applications range from natural extensions of graphics programming that use graphical simulations for physical processes, such as particle flows, and visualizations, such as 3D representations of protein structures, to scientific computing and basic mathematical algorithms. Scientific computing applications include parallelizable algorithms used in genetic research [92] and options pricing [62]. Standard mathematical problems include solving systems of linear equations [26], fast Fourier transforms, the Floyd-Warshall algorithm, and sorting [62]. When the purpose of using the GPU is to obtain faster processing as opposed to isolating data from programs running on the operating system, applications can split processing between the CPU, system memory and the GPU to work around GPU limitations. How the work is split between the system's CPU and memory, and the GPU must be considered when using a combination of the system's and CPU's resources. The number of data transfers between the CPU's and system's memory, and how often the GPU and/or CPU must wait for the other to complete an operation before proceeding can negate any benefit the GPU provides for the steps it performs. A loop that runs in the CPU and maintains a counter or performs conditional tests, but whose main body executes in the GPU is an example of how to split work when conditional statements are needed that cannot be performed in the GPU. However, the conditional tests must at most involve transferring a small number of data from
24
CRYPTOGRAPHICS
the GPU to the system's memory. An example of this is one of the sorting programs mentioned in [62]. In contrast, a program that performs most of a loop within the GPU but needs to transfer all data back to the system's memory in order to perform part of the loop may result in the data transfers decreasing or eliminating the performance gained from using the GPU. For cryptographic processing, it is acceptable if operations that do not involve any intermediate results are performed on the CPU. For example, when implementing a block cipher with r rounds, the loop maintaining the round counter can run on the CPU while the body of the loop, which is the round function, executes in the GPU.
Chapter 3 MOTIVATION
3.1
Overview
In this chapter, the main reasons behind using GPUs to perform cryptographic operations are discussed. Specifically, these are (a) accelerating common cryptographic operations with commodity hardware (i.e., without using dedicated extension cards), (b) protecting against some types of phishing or monitoring attacks, and (c) eliminating some types of side channel analysis and differential fault analysis that can reveal sensitive information (such as private cryptographic keys). For each of thefirsttwo reasons, the problem is stated, related work and previous approaches are described, and the GPU-based approach is summarized. The prevention of certain side channel and differential fault attacks is a side benefit of GPU-based encryption. These types of attacks are described and their applicability, or lack thereof, to GPU-based ciphers is discussed.
3.2 Accelerating Cryptographic Processing 3.2.1 Issue In a large-scale distributed environment such as the Internet, cryptographic protocols and mechanisms play an important role in ensuring the safety and integrity of the interconnected systems and the resources that are available through them. The fundamental building block such protocols depend on are cryptographic primitives, whose algorithmic complexity often turns them into a real or perceived performance bottleneck. To address this issue, vendors have been marketing hardware cryptographic accelerators that implement such algorithms [24, 32, 42, 45, 50]. Others have experimented with taking advantage of special functions available in some CPUs, such as MMX instructions [4]. In some ways, this trend mirrors the evolution of high-performance GPUs to match the needs of the computer-gaming community. Note that, despite the increasing
26
CRYPTOGRAPHICS
performance of CPUs, CPUs are considered essential for "serious" gaming (or other graphics-intensive applications), both because they often outperform the system processor and, perhaps more importantly, because the CPU can be used to complete other tasks while the GPU is handling the graphics-rendering part of the application.
3.2,2
Previous Approaches
The early work on cryptographic accelerators was characterized by a focus on the hardware accelerator rather than its implications for overall system performance. The work described in [80] examines cryptographic subsystem issues in the context of securing high-speed networks, and observed that the bus-attached cards would be limited by bus-sharing with a network adapter on systems with a single I/O bus. A second issue pointed out in that time frame is the cost of system calls [63]. A third issue is the cost of buffer copying [22, 35, 71, 79]. These issues still exist, and continue to require aggressive design to reduce their impacts. The OpenBSD Cryptographic Framework (OCF) [38] offers a general interface to hardware cryptographic accelerators that can be used to improve the overall performance of such protocols. As interest in security is currently in an upswing, recent work has been examining the overall performance impact of security technologies in real systems. Work by Coarfa, et al. [15] examined the impact of hardware accelerators in the context of TLS-enabled web servers using a trace-based methodology, and concludes that there is some opportunity for acceleration, but given the choice one might prefer a second processor as it also assists with the substantial (and perhaps dominant) non-cryptographic overheads. Miltchev, et al. [53] provide some basic performance characterizations of IPsec as well as other network security protocols, and the impact acceleration has on throughput. The authors conclude that the relative cost of high-grade cryptography is low enough that it should be the default configuration. In [28], the authors examine the benefits of using elliptic curve-based public key cryptosystems, which they show can improve HTTPS performance by 13%-30% in realistic workloads, with a greater benefit possible as servers move to larger key sizes. A hardware architecture for accelerating elliptic curve operations is presented in [77]. Boneh and Shachman describe a technique for improving SSL handshake performance in [7]. The technique demonstrates that it is faster to do n SSL handshakes as a batch than n handshakes individually, based on a technique for batching RSA decryptions. It also shows a speedup factor of 2.5 for n = 4. It is important to note that this speedup only applies to the handshake portion of the SSL connection, not to the data transport itself. By caching session keys, the Goldberg, et al. [27] demonstrate a reduction in download time of secure web documents of between 15% and 50%. Again, this technique only accelerates the handshake portion of the SSL connection, without reducing the data transport time.
Motivation
27
While the performance improvement that can be derived from accelerators is significant [38], only a relatively small number of systems employ such dedicated hardware. The approach CryptoGraphics takes is to exploit resources typically available in most systems. The majority of systems, in particular workstations and laptops, but also servers, include a high-performance GPU (also known as a graphics accelerator). Due to intense competition and considerable demand (primarily from the gaming community) for high-performance graphics, such GPUs pack more transistors than the CPUs found in the same PC enclosure at a smaller price [47]. GPUs provide parallel processing of large quantities of data relative to what can be provided by a general CPU. Performance levels exceeding the processing speed of a 1 OGhz Pentium processor have been reached, and GPUs from Nvidia and ATI are functioning as co-processors to CPUs in various graphics subsystems [47]. GPUs are already being used for non-graphics applications, but presently none are oriented towards security [59, 82].
3.23
Summary of the GPU-Based Approach
The goal is to utilize the GPU to offload work from the system's CPU and to increase the rate at which data can be encrypted and decrypted by applying a cipher to a large amount of data simultaneously. The CryptoGraphics work consists of determining what existing cryptographic operations are suitable for implementation within GPUs and experiments utilizing GPUs to run symmetric key ciphers. First, the feasibility of executing existing symmetric and asymmetric key ciphers in GPUs is evaluated. Second, a symmetric key cipher, AES, is implemented to run in GPUs. This demonstrates the feasibility of GPU-based encryption and the ability to encrypt a large number of data blocks simultaneously using the parallel processing of a GPU. Third, the use of GPUs for applying stream ciphers is investigated. GPUs serve as an easy to use mechanism for applying a key stream to a large segment of data at once. The work with existing ciphers illustrates why certain ciphers are implementable within a GPU and others are not. Specific byte-level operations and substantial byte-level manipulation found in symmetric key ciphers pose obstacles when implementing some of the algorithms. Some of the operations can be performed by a series of steps within the GPU, for example using tables for byte-level shifts, while other operations cannot be implemented in a GPU, for example rotations across bits stored in multiple pixels. Asymmetric key ciphers, in general, involve operations and/or require data types that are not presently supported in GPUs. Specifically, modular arithmetic using large integers.
28
3.3 3.3.1
CRYPTOGRAPHICS
Malware and Spyware Issue
Spyware has been recognized as a major threat to user privacy [9, 87]. Especially when combined with a large-scale distribution mechanism (such as a popular web site or application, or a computer worm), the potential for largescale security violations is considerable. Organizations increasingly spy on their employees' computer activities using the same technology. Furthermore, public computers on Intemet cafes are so riddled with such malware that only the most foolhardy of souls would use them for any sensitive application. Phishing or other snooping attacks, whose purpose is to extract confidential information from a user, come in many forms. One way of categorizing such attacks, that is relevant to the discussion, is based on the means with which information is extracted, as shown in Figure 3.1. At a very coarse level of granularity, these are: (a) extract (harvest) information from the filesystem or other storage device (e.g., search the cached web pages for information looking like bank account numbers); (b) extract information from the operating system or other processes as the user is interacting with these (e.g., intercept user keystrokes, read the display framebuffer, or read another process's address space as the user is typing confidential information); (c) redirect connections to a fake web site (e.g., by subverting the Domain Name System (DNS) resolution or changing web proxy settings) to which the user provides the confidential information; (d) present false but difficult to verify information that misleads the user into actively contacting a fake (decoy) web site.
3.3.2
Motivating Applications
Applications of GPU-based decryption include remote desktops (a thin-client scenario) and video conferencing displays. In a thin-client scenario, the client connects to a server that fulfills all of the client's computing needs [54]. Since all application logic is executed in the server, the client is completely stateless, and does little more than display updates sent by the server and forward local user input events. Current thin-client systems provide secure sessions by encrypting the display protocol before it is transferred over the network. However, in scenarios where the client terminal is untrusted, such as public computers, it may not be desirable for the host operating system to have access to the unencrypted display updates. For example, consider the system presented by Koller et al. [41]. In this case, access to sensitive 3D data was controlled by manipulating the content sent to the remote display client. However, since the display data on the client could not be secured, a number of additional mechanisms had to be devised to prevent the actual chent application from being used as an attack tool on the system. On the other hand, if the display is only in decrypted form
29
Motivation
Keystroke logger Rootkits
Application monitoring
Network eavesdropping and redirection
User
User Interface attacks
Network
Compromised application (e.g., malicious plugin)
Local Storage
Information harvesting
Figure 3,1. Various Attack Points for Phishing: Methods by which sensitive information can be extracted involve information harvesting in the filesystem, eavesdropping on user actions (application monitoring, keystroke logger, network eavesdropping), presenting false information (UI attacks), and redirecting the user to decoy sites.
within the GPU, only reads of the current display by other applications need to be blocked. In video conferencing, it is desirable to prevent clients from copying the conference displays. How to secure video and audio that is recorded at the client is also an issue. The concept demonstrated with GPUs can be applied to digital cameras and digital signal processors to encrypt video and audio before it reaches the system's memory. Similar concerns about copying, intercepting or modifying data arise in handling voice traffic, as noted by Walsh and Kuhn [87]. While there are existing digital rights management (DRM) architectures aimed at preventing unauthorized copying of video, the images are still decrypted within the remote and untrusted operating system. DRM includes how to manage the usage and trade of material [64] and must protect against both unauthorized access and unauthorized copying. An example is Microsoft's Windows Media Player DRM 9 Series, which includes the capability of authenticating and remotely-keying the media player [52]. The images are decrypted within the operating system by the media player then sent to the GPU. This architecture's security depends on using a specific closed-source media player and no program being able to access the memory utilized when decrypting the data. Alternative models of using trusted GPUs have been considered [3], but have not been implemented. The Trusted Computing Group's architecture aims to address the issue of untrusted components and malicious software. It is an approach that is much wider
30
CRYPTOGRAPHICS
in scope than the issues addressed by GPU-based encryption. The proposed architecture utilizes distinct trusted platform modules (TPMs), which may be hardware or software, to address multiple needs and provide a generic solution [84]. For graphical applications, GPU-based encryption and decryption can be considered as an alternative that avoids specialized system components, or as a companion to TPMs. In particular, one possibility is for the TPM to handle key negotiation with the remote server, and then provide the session key to the GPU. An overview of the TPM is provided in Chapter 6. The main goal in moving decryption of graphics into the GPU is to prevent the underlying operating system or other software from gaining access to the unencrypted data. Specifically, to thwart malicious software, running on the client's operating system, that attempts to read or modify displays and responses transmitted between the server and the client. Modifications to the client's hardware are not proposed. Security of the client's surroundings {e.g., a camera recording the client's display) is a separate problem outside the scope of CryptoGraphics.
3,3.3
Other Related Work
Most recent work on addressing this problem has focused either on detection of spy ware activity on a system or building a trusted system from the bottomup, using a combination of hardware support, operating system extensions and application-specific logic. Rootkit and spy ware detection measures include techniques such as: • Static program (binary) analysis to identify use of possibly dangerous system calls or libraries [13]. Such techniques are countered by malware authors through the use of polymorphism and metamorphism. • System emulation (typically through an instruction-level emulator) to determine the behavior of an untrusted piece of code. This technique has been used by antivirus vendors for many years, to detect some of the more successful self-modifying viruses. • System call (or behavior) monitoring to identify code that attempts to register as a driver or otherwise interact with the system in a suspicious way [12,73]. • The use of a virtual machine (VM) to identify the presence of rootkits. This is achieved by accessing the guest operating system low-level data structures, such as the list of active processes or operating system drivers, both through the guest operating system (where the results will be manipulated by the rootkit) and by directly accessing the same data structures from the host operating system. Any discrepancies when comparing the results would indicate the presence of a rootkit.
Motivation
31
• Similar in concept to the previous item, the use of specialized hardware (e.g., in the form of a PCI extension board) to monitor the state of kernel data structures [34]. Upon detection, the system may be rolled back to a known clean state or, if possible, disinfected. • Periodic rebooting of the system to bring it to a known clean state, and monitoring of driver loads. Trusted computing platforms [81] provide new primitives that applications (and the operating system) can use to strengthen the security of the system. Although a fair number of such schemes has been proposed, and the details of many industry-led efforts keep changing as the requirements evolve, there are some pieces of functionality that seem to be gaining acceptance as useful features: • The simplest feature involves the use of extra hardware as a storage facility for sensitive information such as keys and passwords [14, 49]. These keys and passwords are only released to authorized processes, which are identified by their in-memory signature or by the operating system. • Another piece of functionality involves the validation of the various software components running in the system [72, 85, 91]. This is typically an iterative process, wherein each layer of the system verifies the integrity of the next layer before passing control to it (or allowing it to execute) [1]. For example, the operating system will verify the signature or hash of any program that is about to be loaded. "Unknown" programs may not be allowed to proceed or (more likely) may be disallowed from accessing the password store facility [11]. The lowest levels of the system software (BIOS, extension card ROMs, boot loader) are validated by the trusted hardware at boot time. If a discrepancy is detected, the system may be initialized from a known good image [2]. One problem with this approach is that such mechanisms only validate the integrity of the system at boot time (or, in any case, infrequently); thus, a system compromise may be undetected for a long time, during which sensitive information may be captured. However, static system integrity verification can be combined with dynamic checks and with VM-based sandboxing to minimize the scope of a successful attack. • A third security feature is the ability to "lock" certain memory pages that contain sensitive information, preventing any other code, including the operating system and even the program itself from accessing them except under specific conditions. Such conditions include accessing the pages through a particular set of functions whose code has been carefully scrutinized and is trusted to use the information in a safe manner. For example, passwords
32
CRYPTOGRAPHICS may only be accessible (readable) from a piece of code that will encrypt the password and transmit it to a remote site under an SSL session.
• New hardware may be also used to implement a "trusted path" between an input device such as the keyboard (or an output device such as a monitor, to protect against output-side attacks) and an application [90]. When constructing such a trusted path, I/O is performed directly between the device and the application, bypassing device drivers and the rest of the operating system. A trusted path on the input side could be used by a browser to securely retrieve a password that a user enters on their keyboard, with no possibility of interception by a keystroke logger or other snooping malware. On the output side, a trusted path could be used to securely present sensitive information such as a bank statement. In addition to having well defined primitives that applications can use to bypass the operating system when constructing a trusted path, the hardware must also be able to provide to the user a non-spoofable indication that the trusted path feature has been enabled. For example, there may be a conspicuous LED indicator on the keyboard that can only be activated by the security hardware. Lacking such a scheme, users can never be certain as to where their sensitive information will end up or came from. Ideally, it should also be possible to identify the application that is at the other end of the trusted path. While promising, these approaches offer only limited security against an adversary that legitimately controls the spyware-infected system, or against spy ware that does not exhibit real-time activity (e.g., consider a program that simply takes snapshots of the system's screen as the unsuspecting user is accessing some sensitive information). While images, like any data, can be sent encrypted over networks using existing protocols such as TLS [21] and IPsec [37], decryption is performed by the operating system, creating the potential for the data to be copied by an untrusted client. In principle, isolation techniques, such as VM-based sandboxing of different processes, can help to minimize the actions of an attacker that compromises a single application. In practice, however, the fact that a small number of applications (e.g., web browsers, email clients) are used for a variety of tasks severely limits the effectiveness of such schemes: a successful compromise of the user's web browser will likely allow an attacker to observe and exfiltrate, over time, most of the interesting sensitive information. Furthermore, the degree of integration among different applications limits the practicality and effectiveness of sandboxing. In [30] an open-source cryptographic co-processor is described. It is focused on protecting keys and other sensitive information from tampering by unauthorized applications. The cryptlib library is extended to communicate with the co-processor. While [30] discusses several options for hardware acceleration
Motivation
33
and identifies some potential performance bottlenecks, it is mostly a qualitative analysis. That work is extended in [29], which presents a comprehensive cryptographic security architecture, again focusing primarily on preserving the confidentiality of users' (and applications') cryptographic keys, with similar work discussed by McGregor and Lee [49]. In CryptoGraphics, a simpler problem is of interest: how to protect data sent to an untrusted client to be displayed to a user while using hardware commonly available in all systems, namely the GPU.
33,4
Summary of the GPU-Based Approach
A system's GPU is proposed as the only trusted component in a spywaresafe system for displays. By using GPUs, existing capabilities within a system are leveraged as opposed to designing and adding a new component to protect information sent to remote displays. Specifically, sensitive content is directly passed to the GPU in encrypted form. The GPU decrypts and displays such content without ever storing the plaintext in the system's main memory or exposing it to the operating system, the CPU, or any other peripherals. A remote keying protocol is used to securely convey the decryption key(s) to the GPU, without exposing them to the underlying system. With this mechanism as the basic building block, applications such as secure video broadcasts or remote desktop display access can be implemented without trusting the rest of the system. Furthermore, the design allows a user to securely enter a password or PIN to a remote system without revealing it to any spy ware and without requiring additional hardware. Finally, by using a suitably modified USB keyboard, it is possible to completely protect the user's communications with a remote server.
3.4
Side Channel and Differential Fault Analysis
A related topic concerns types of attacks that involve monitoring and/or physically damaging the device performing the crytopgraphic operations. These types of attacks rely on at least partial access to the component performing the encryption. The first type is passive side channel analysis in which resources of the device performing the cryptographic operation is monitored or information the device emits is monitored. The second type is differential fault analysis. This involves introduction of faults into the device and comparing the results of the operation performed with and without the fault. The idea for these two categories of attacks first appeared in the late 1990's. Side channel attacks involve observing side channels of a device. The type of information monitored include power consumption, timing characteristics, electromagnetic emanation and acoustic emanation. For example, the time and/or power involved in performing an exponentiation may vary depending on the exponent in a public key algorithm [40]. There are companies, for
34
CRYPTOGRAPHICS
example Riscure, that sell software for performing power analysis on smart cards. The use of CPU acoustics in attacking RSA was also found to have potential [76]. In the acoustic attacks, the sound omitted from the CPU can be separated from that of the fan because the fan noise is typically less than lOKhz; whereas, the CPU is above lOKhz. Early work on applying side channel attacks to symmetric key ciphers includes [36], which discusses side channel attacks on block ciphers based on their structure as a number of rounds. Examples of how certain types of side channel information can be used to attack DES [55] , IDEA [43] and RC5 [67] is covered. Block ciphers used in practice are structured as a series of rounds. The side channel attacks attempt to recover internal state information, such as the output from or input to a round in order to determine enough round key bits per round to allow an exhaustive search on the remaining bits. Memory usage has also been shown to be a valuable source of information. [60] demonstrated how the cache, which is shared memory, contributes to information leakage. By having a second process accessing the same memory used by the cipher, enough information is obtained to determine the expanded key for AES. Differential fault analysis involves inducing faults into the device performing the cryptographic operation then observing outputs prior to the fault and after the introduction of the fault. For example, by using radiation to damage a device. The concept of fault analysis was first proposed in [6] for public key ciphers. The method was discussed in terms of symmetric key cipher in [5], which described how the round key bits of DES, the standard block cipher at the time, could potentially be recovered through the introduction of faults. However, differential fault analysis attacks assume that an attacker is able to introduce faults into a sealed tamperproof device. This can be achieved through several means, such as radiation, with a probability that a fault is created in a (unknown) single bit location in one of the registers at some random intermediate stage in the encryption or decryption. As a result, this concept is less practical than side channel analysis. The concept of side channel analysis and differential fault analysis may apply to CPUs; however, specific methods and tools for applying these types of attacks will not work on CPUs. As a result, GPU-based encryption prevents existing instances of such attacks. New analysis applicable to CPUs may prove that some of the attacks are feasible for GPU-based encryption. For example, monitoring the power consumption and heat dissipation of GPUs may prove to be of assistance in trying to determine the magnitude of any parameter for which the amount of work performed varies in accordance with the parameter's value. It has been shown that GPUs' power consumption varies with the images generated in different games; therefore, the same may be true for other algorithms run in the GPU. Another potential source of information is the processing time of commands in the GPU. This may be measured by timing the
Motivation
35
delay in commands issued from the CPU to the GPU, if the CPU is issuing OpenGL or DirectSD commands. Methods requiring another process to run simultaneously with the encryption algorithm in order to perform measurements or obtain information, such as the attack described in [60], are unlikely to work with the GPU. This is because the process would have to run either within the GPU or issue commands to the GPU while the cryptographic operations are being performed. This is not possible when other processes are blocked from accessing the GPU.
Chapter 4 ENCRYPTION IN GPUS
4.1
Overview
This chapter presents the technical details involved in using GPUs to perform encryption and decryption. The feasibility of running either a symmetric or asymmetric key cipher in a GPU is evaluated by considering the types of operations required by existing ciphers and the operations supported in GPUs. Then the use of symmetric key ciphers in a GPU is further evaluated using two examples. The first example consists of an implementation of AES in OpenGL to demonstrate that encryption within a GPU is possible using an existing symmetric key cipher. A detailed description of how AES is converted to a representation that allows it to be run within a GPU is provided. The derivation of the OpenGL version of AES and its implementation illustrates the difficulties that arise when utilizing GPUs for algorithms performing byte-level operations. The second example concerns the use of GPUs to apply the key stream from a stream cipher to a large segment of data in one step. While the fact that GPUs can be used to XOR large quantities of data is trivial, this is an initial step of how to apply a stream cipher within a GPU and exemplifies the benefit of using a GPU as a general purpose parallel processor. Any implementation of a cipher that uses a GPU will involve executing a program on the operating system that issues commands to the GPU. The commands are independent of both the data being encrypted or decrypted and the key bits. It is acceptable that other processes running on the operating system may be aware of what commands are being issued to the GPU because the algorithm is not a secret. The implementation must not involve intermediate results being written from the GPU to the system in order to prevent attacks that try to recover plaintext and/or key bits using intermediate results when the intent is to prevent spy ware on the system from accessing the plaintext and/or key. If
38
CRYPTOGRAPHICS
the only purpose in using a GPU is to offload the CPU and/or system memory when performing encryption and decryption, then a combination of the CPU, system memory and GPU can be used. The motivation for the experimentation with GPU-based encryption and decryption described here includes protecting data from spy ware on untrusted clients; therefore, the focus is on implementing a cipher with all commands executed in the GPU and all intermediate results confined to the GPU. Modes of encryption for block ciphers are also discussed. When dealing with block ciphers, in addition to implementing the cipher within a GPU, the mode in which the block cipher will be used must be considered. A CPU's ability to encrypt thousands or hundreds of thousands of data blocks in parallel is only beneficial if either no mode of encryption requiring chaining of data blocks, such as cipher block chaining (CBC), is used or if each of the data blocks being encrypted simultaneously corresponds to a separate stream of data to which chaining is applied. Asymmetric key ciphers are evaluated first in Section 4.2. Presently there is no way of implementing an existing asymmetric key cipher in GPU without modifications to the GPU. As discussed in Chapter 5, this presents challenges when trying to securely convey a secret key to the GPU for use in a symmetric key cipher because existing remote keying protocols commonly rely on public key ciphers. The remainder of this chapter then focuses on symmetric key ciphers. In Section 4.3, common operations used in existing symmetric key ciphers and how they can or why they cannot be programmed in a GPU are identified. Section 4.4 reviews the common modes of encryption for block ciphers to show that they can be performed within a GPU and discusses how parallel processing of data can be performed within a GPU for each mode. Section 4.5 describes an implementation of AES in OpenGL. How AES can be converted into a representation that can be programmed in OpenGL is explained. The OpenGL version of AES is described and results from experiments comparing the encryption rates of the OpenGL version to C versions of AES are provided.
4.2
Feasibility of Asymmetric Key Ciphers
Existing asymmetric key ciphers are based on the hardness of factoring and of computing discrete logs of large integers. Asymmetric key ciphers, also referred to as public key ciphers, are typically not used to encrypt large quantities of data but instead to encrypt small values or to establish a shared value that can then be used as a shared secret, such as the key to a symmetric key cipher. This is due to the types of computations required and the rate at which they can be performed. Public key algorithms (RSA [70], Diffie-Hellman (DH) and Elliptic Curve Cryptography (ECC) algorithms) are not suitable for implementation in a GPU due to their use of large integers.
Encryption in GPUs
39
The following is a brief overview of RS A, DH and ECC to remind the reader of the operations utilized in public key cryptography. RSA: Given two primes, p and q with p 7^ g, let n = pq. (I){n) — {p — l){q — 1). Select an integer, e, where 1 < e < 0(n), that is relatively prime to (j){n). Compute, d, the multiplicative inverse of e mod (j){n). The pubUc key is (e, n) and the private key is (d, n). Data is encrypted by representing it as an integer, m < n. The ciphertext is c = m^ mod n. c is decrypted by computing m = c^ mod n. DH: A large prime, p, and an integer g < paxe publicly known. Two entities, A and B, can establish a shared secret value as follows: A selects a random integer a and B selects a random integer b. They do not share a and b with each other. A computes Va — g^ mod p and B computes v^ = g^ mod p. A and B exchange Va and t;^ with each other. Then A computes s — (vi^)^ mod p and B computes s =^ (va)^ mod p. In both cases s = g^^ mod p. s is the secret value A and B share. ECC: An elliptical curve is defined and a large prime, g, selected. The set of points, {(x, y)}, on the curve where both x and 2/ are integers less then q are of interest. Point multiplication can be used to establish keys. If an integer k is the private key and G is a point in {(x, 2/)} then P = kG is the public key. Given P and G, k cannot be found in reasonable time. Let A have public and private keys PA and kA, respectively, where PA ~ kAG. Let B have public and private keys PB and ks, respectively, where PB = fc^G. A and B can establish a shared secret value by A computing /C^PB and by B computing ^BPA^ both values equal kAkBG. When used to establish a 128-bit key for AES, the parameter sizes recommended by NIST are a 3072-bit integer for the RSA and DH keys (the exponents) and a 256-bit key for ECC algorithms . The magnitude of the parameters makes it infeasible to program any of the three algorithms to run entirely in existing GPUs. While the exponentiation used in RSA and DH can be implemented as repeated multiplications or additions, GPUs do not support variables of the size required and it is not possible to use multiple pixels to store and operate on a single integer. Even if large integers were supported in GPUs, the lack of modular arithmetic in GPU APIs also poses a problem. In some cases this can be accommodated using table lookups when only one parameter is a variable. But when the two values being added or multiplied and the modulus are variable, there is no obvious way of performing the operation entirely in the GPU without exposing at least one of the operands outside of the GPU (the modulus is not secret and can be exposed outside of the GPU). Furthermore, using a table is infeasible because of the number of entries it would have to contain. As a result, asymmetric key ciphers are not suitable for implementation within existing GPUs.
40
4.3
CRYPTOGRAPHICS
Feasibility of Symmetric Key Ciphers
Within this section the operations typically found in symmetric key ciphers (block and stream ciphers) are considered in terms of how they can be implemented within a GPU. The use of floating point arithmetic and the fact that the GPU APIs are not designed for typical byte-level operations, which are common in symmetric key ciphers, creates obstacles when trying to implement a block or stream cipher within a GPU. In general, most operations found in block ciphers can be implemented via a series of logical operations and table lookups in a GPU, but there are some operations that cannot be represented in a GPU. Stream ciphers tend to have a larger quantity of operations that are either not implementable or not easily implemented in a GPU. A block cipher is an algorithm that performs a permutation of all &-bit strings for a fixed value of b. The exact mapping of b bit inputs to 6-bit outputs is determined by the key. b is the block size. When encrypting, a block cipher takes a 6-bit plaintext and a secret key as inputs, and produces a 6-bit ciphertext. When decrypting, the ciphertext text and secret key are the inputs and the plaintext is the output. In practice, the block ciphers typically support block sizes of 128 or 256 bits and use keys of 128, 192 or 256 bits. DBS, the standard prior to AES , is still in use and operates on 64-bit blocks. Block ciphers used in practice are structured as a series of rounds. In each round, a function, referred to as the round function, is applied to the data. The output of the i^^ round forms the input to the (i + 1)^* round, with the plaintext being the input to the first round when encrypting. A secret key is input to the block cipher along with the data being encrypted or decrypted. The key is expanded to a series of round keys for use in the individual rounds via a function referred to as the key schedule. A stream cipher is an algorithm that takes a secret key as input and applies a function to it that continuously outputs pseudorandom bits. The output is called the key stream. In practice a stream cipher needs to pass certain statistical tests to be viewed as outputting pseudorandom bits. When encrypting, the bits are XORed with the plaintext to produce the ciphertext. When decrypting, the bits are XORed with the ciphertext to produce the plaintext. At some point the function will return to its initial state and the key stream will repeat (and thus is distinguishable from random bits at this point). The number of bits output before the key stream repeats is called the period. Within one period of the key stream, when given a segment of the key stream, it should not be possible to determine other keystream bits and/or any key bits with non-negligible probability. The following operations are common in existing block ciphers: • substitution boxes (S-Boxes): A S-Box is used to perform a permutation on part of the data block. A table lookup is performed on the data bits and the data bits replaced with the table entry.
Encryption in GPUs
41
• bitwise rotations • permutations of bits, bytes or words • logical operations: XOR is the most common logical operation • modular arithmetic: addition, subtraction, multiplication • indexing into arrays In addition to the above operations, the following operations are found in various stream ciphers: • bitwise shifts • conditional statements • logarithmic functions (which can be implemented as a table lookup) While simple logical operations can be performed efficiently in GPUs on large quantities of bytes, as shown in Section 4.6.2, the byte and bit-level operations typically found in symmetric key ciphers, such as shifts and rotations, are not available via the APIs to GPUs. Modular arithmetic operations are also not available. Some operations can be performed using a combination of OpenGL commands. For example, rotations and shifts on a single byte (or the number of bits in one color component of a pixel) can be performed by defining masks of pixels and using multiple copy commands. Other operations, such as shifts across multiple bytes and table lookups where the value of specific bits determine what table is used, prove to be more difficult. When determining whether or not encryption using a block cipher can be performed in a GPU, several block ciphers were evaluated. While AES is not the only block cipher than can be implemented to run in a GPU, AES was chosen for implementation because it is the current NIST standard. There are also block ciphers that cannot be implemented within current GPUs. In order to provide examples of operations that cannot be performed in a GPU using existing APIs, a few specific steps from such block ciphers are considered. There is no straightforward way to implement in OpenGL the data dependent rotations found in the block ciphers RC6 [66] and MARS [18]. A rotation by a constant amount can be performed by defining a colormap that maps one color component of a pixel to the result of the rotation. The rotation is performed by tuming on the colormap and copying the pixel whose components are being rotated to another location. However, if the rotation amount varies based on the value of a pixel component, as in the case of the data dependent rotations, there is no method for deciding within the GPU which colormap to turn on and apply it to a pixel based on the value of another pixel. Instead, the pixel value that determines the rotation amount has to be read into system memory and its
42
CRYPTOGRAPHICS
value used to issue a command to the GPU to indicate which colormap to turn on, potentially giving malware access to intermediate results in the encryption process that can be used to attack the cipher. The use of a table is not possible if the rotation involves bits in multiple pixels and/or color components. RC6 is defined to operate on blocks of size 4w where w is the word size, and where the key size and number of rounds are parameters. The following is a description of RC6's encryption function when operating on 128-bit blocks (w = 32) and 128-bit keys with 20 rounds. Refer to [66] for a description of RC6's key schedule and decryption function. The 128-bit block of data is broken into four 32-bit words. These will be referred to as XI, X2, X3 and X4. r is the number of rounds and is set to 20. The array, S, contains the expanded key. Each entry is a 32-bit word, x < < < y means to rotate x to the left by y positions. The rotation amount is log2(w), which is 5 for 128-bit blocks. All arithmetic is performed modulo 2^. RC6's Encryption Function: X2 = X2 + S[0] X4 = X4 + S[1] for (i=0; i < 20; ++i) { t = (X2*(2*X2+l))<«5 u = (X4 * ( 2 * X 4 + 1 ) ) « < 5 XI = ((XI XOR t) < « u) + S[2i] X3 = ((X3 XOR u) < « t) + S[2i+1] (X1,X2,X3,X4) = (X2,X3,X4,X1)
} Xl=Xl+S[2r+2] X3 = X3 + S[2r+3] RC6's decryption algorithm involves subtracting key words instead of adding them and rotating to the right when forming XI and X3. The equations for t and u do not change. When restricted to 32-bit pixels, it was not possible to implement RC6 for 128-bit blocks in a GPU because of the lack of modular arithmetic and the inability to rotate the bits in the pixel as a single bit string. Even when using 128bit pixels, the lack of modular arithmetic prevents RC6 from being programmed to run in a GPU. In theory, the computations for t and u, and the rotations by 5 bits can be performed by table lookups, but this would require tables containing 2*^^ entries. The addition of key material poses a greater problem because the lack of modular arithmetic cannot be overcome by using a table, since neither operand is constant. Likewise, the rotations performed when calculating XI and X3 are data dependent, and thus cannot be performed by a table lookup unless one operand can trigger what table to use.
Encryption in GPUs
43
Also consider the DES S-Boxes [55]. The index into the S-Box is based on 6 key bits XORed with 6 data bits. The S-Box can be represented by a colormap and the entry into the S-Box can be computed using the copy command and the logical XOR operation. Masks of pixels copied onto the data can be used to "extract" the desired 6 bits, but to merely XOR the 6 key bits with 6 data bits requires copying the pixel containing the desired key bits onto the pixel containing the mask with XOR turned on, doing the same for the data pixel, then copying the two resulting pixels to the same position. Once the XOR of the key and data bits is computed, the pixel containing the result of the XOR has to be copied with a colormap corresponding to the S-Box turned on in order to complete the S-Box lookup. Overall, to use OpenGL for the S-Box step in DES, a larger number of less efficient operations are required than in a C implementation. Existing stream ciphers, such as LIU [78], RC4 \ SEAL [61], SNOW [23] and SOBER [68], also appear unsuitable for implementation in a GPU. The use of irregularly clocked feedback shift registers in LILI and SOBER make these ciphers either impossible to implement in OpenGL or result in several commands being required to emulate a simple byte-level operation. SEAL, which operates on 32-bit words, also poses a problem by using 9-bit rotations that cannot be performed across pixels or color components of a pixel. Furthermore, prior to SEAL generating the key stream, tables whose entries are dependent on the secret key must be populated. This step involves several key dependent rotations that cannot be implemented in existing GPUs. RC4, discussed in more detail in Chapter 5, creates problems due to the need to use computational results (the results of modular addition) as indices into an array. When using pixels to represent the data, modular addition between one variable and a static value can be performed in GPUs using a colormap, the modular addition of two unknown values requires the ability to select the colormap based on a given pixel value, which is not supported. The following is pseudo code for RC4. The secret key is contained in the 256 byte array K. If the key is less than 256 bytes, it is repeated until all 256 bytes of K are filled. S is a 256 byte array initialized with values 0 to 255. The parameter len represents the number of keystream bytes needed. RC4: /* Initialize S */ for (i=0; i < 256; -f+i) { S[i] = i } /* Incorporate K into S */
j=o for (i = 0; i < 255; ++i) {
44
CRYPTOGRAPHICS j = (j + S[i] + K[i])mod256 swap(S[i],S|j])
} /* Generate the key stream */ i=0 j=0 cnt = 0 while (cnt < len) { i = (i + 1) mod 256 j = 0' + S[i])mod256 swap(S[i],S[j]) /* output a byte to the key stream */ output(S[(S[i] + S[j]) mod 256]) cnt = cnt + 1
For stream ciphers using LFSRs, it is possible to implement the general structure of a feedback shift register in a GPU by either altering the coordinates of the pixels used as input to the register in each clock cycle or by copying the pixels containing the subset of the output that forms part of the next input to the proper coordinates used in the next input. Whether or not a specific feedback shift register can be implemented in a GPU depends on the function that creates the output. Irregular clocking of a feedback shift register is performed by using output and/or current state to determine whether to produce output in the next cycle. The irregular clocking requires a conditional statement, which presently must be executed in the CPU. However, writing any data back to the system's memory for use in the test will expose bits from the key stream and/or intermediate state data to the system. Another issue is how the parallel processing capability of GPUs can be leveraged for stream ciphers. A block cipher with a specific key can be used to encrypt multiple plaintexts. Simultaneous encryption or decryption of multiple texts or blocks with the same key is useful. In contrast, a stream cipher would not be executed multiple times in parallel using the same key and XORed with data. This is because the same key stream should not be used to encrypt different sets of data. Consider what happens if two plaintexts, P I and P2 are XORed with the same key stream, KS, to produce ciphertexts CI and C2. C l e C 2 = PleKSeP2^KS = P I 0 P 2 . An adversary with access to the ciphertexts has the XOR of the plaintexts. This provides the adversary certain information about the plaintexts, including the positions of every bit where P I
Encryption in GPUs
45
matches P2. Therefore, if n streams of data are processed simultaneously, n key streams corresponding to n different keys must be generated in parallel. If the stream cipher can be implemented to run in a GPU in a manner that allows n different instances of it (one for each key) to run simultaneously, then the GPU would be useful for offloading work from the CPU. This is not feasible with a stream cipher using an irregularly clocked feedback shift register because of the conditional statement for the clocking. Even if branching is added to the pixel processor, each instance of the stream cipher would require a separate path of branches, all of which must be executed in parallel. The parallel computation of GPUs does allow a large segment of data to be XORed with a key stream at once. The data (or key stream) is loaded into the framebuffer, the logical operation of XOR is enabled, then the key stream (or data) is written to the framebuffer.
4,4
Modes of Encryption
When encrypting data with a block cipher, the data is broken into multiple blocks with each block containing the number of bits supported by the block cipher. The series of blocks is then encrypted using a mode of encryption. One of the attractive features of GPUs are their ability to process large quantities of data in parallel. The mode of encryption used is of interest not because of the operations required (which are trivial to program in a GPU), but because the mode determines whether or not blocks from the same data source can be processed in parallel within a GPU.
Fn
Ek
EK
Ek
Figure 4.1. ECB Encryption Mode
A few common modes of encryption are the Electronic Code Book (ECB), Cipher Block Chaining (CBC), Output Feedback (OFB), Cipher Feedback (CFB),
46
CRYPTOGRAPHICS
Figure 4.2. CBC Encryption Mode
IV+n-1
Figure 4.3. CTR Encryption Mode
and Counter (CTR) modes. These modes are shown in Figures 4.1, 4.2, 4.3, 4.4 and 4.5. In all of the diagrams. Pi refers to the i^^ block of the plaintext, Ci refers to the i^^ block of the ciphertext and E^ is the encryption algorithm using key k. In the CBC, OFB, CFB and CTR modes, IV is an initialization vector. All of the operations required of these modes (beyond the operations required of the block cipher) can be performed in CPUs. ECB encrypts each block separately and requires no operations aside from the application of the
47
Encryption in GPUs
l, = IV
1^ bits x+1 to b
\ discarded
Xi
ix,
ln-1 bits x+1 to b
i discarded
LXn-1
I
discarded
i Xj = leftmost X bits of the b bit output from the cipher Pj is X bits l = L bits x+1 tob||X:^
Figure 4.4. OFB Encryption Mode
block cipher. All of the other modes require XORs. CBC XORs the (i - 1)^* ciphertext block with the i^^ plaintext block before encrypting the i*^ block. CTR, OFB and CFB apply the block cipher to values generated from an IV instead of to the plaintext and are used to create a key stream that is XORed with the plaintext. They differ in how the IV is updated after each block to create the next input to the next application of the block cipher. OFB and CFB may require bit extraction, depending on the parameters, to only use a subset of the bits output from each application of the block cipher. The bit extraction can be performed by XORing the pixels containing the bits to be extracted with pixels containing static masks. Each of the five modes shown allow blocks from different data sources or streams to be processed in parallel. Only ECB and CTR modes allow data from the same source or stream to be processed in parallel. Modes that allow for parallel processing of data are ideal when encrypting in a GPU because a large quantity of blocks can be represented by rows or columns of pixels. While ECB mode allows encryption to be performed on blocks from a single data source in parallel, ECB is not recommended for use in practice because identical plaintext blocks produce identical ciphertext blocks, allowing patterns to be recognized. This is especially apparent with displays where areas of the display are the same color, and pictures with a uniform background, in which
48
CRYPTOGRAPHICS
«n-i bits x+1 to b
discarded
xbits
discarded
Cipher outputs b bits, the rightmost b-x bits are discarded. Pj is X bits Ij = lj.i bits x+1 t o b l i a ^
Figure 4.5. CFB Encryption Mode case using ECB produces outlines of shapes. CBC eliminates the problem with patterns, but does not allow blocks from the same data source or stream to be encrypted in parallel because the i^^ block depends on the ciphertext from the (i — 1)^* block. When using CBC mode, the parallel computation of the GPU can be leveraged if multiple data sources or streams are available to be encrypted simultaneously. OFB and CFB each use output from the (i — 1)** application of the block cipher to form the input to the i^^ application. This prevents parallel computation of the key stream and parallel processing of data blocks from the same source in both of these modes. The key stream "blocks" produced by CTR mode can be computed in parallel because the input to the i*^ application of the block cipher is the IV + i — 1. Pixels can be loaded with these values then encrypted simultaneously and the result XORed with the data simultaneously.
4.5 Example: AES 4.5.1 AES Background AES [56] is the 128-bit block version of Rijndael [19] and is the current NIST standard block cipher. AES can be used with 128, 192 and 256-bit keys. The OpenGL implementation of AES described in this chapter supports 128-bit keys and all descriptions of AES included here assume 128 bits keys
Encryption in GPUs
49
are used. In order to understand the OpenGL implementation of AES, it is necessary to know how AES is represented in software implementations. The steps described in FIPS 197 [56] can be rearranged and combined to produce a representation in which the round function consists entirely of table lookups and XORs. This later version results in significantly faster encryption and decryption rates compared to an implementation following the steps in FIPS 197, but requires additional memory to store the tables. Neither of the two versions can be implemented directly in OpenGL. Instead, a representation of AES that is a modified representation of the table lookup version is used. For 128-bit blocks and 128-bit keys, the AES round function for encryption is typically described with data represented as a 4x4-byte matrix with each entry containing one byte. / aoo ^01 aio ail ^20 ^21 \ ^30 a3i
ao2 ^03 \ ^12 ai3 a22 ^23 a32 a33 /
The round function is applied ten times. The data, A, is XORed with key bits prior to the first round. A round consists of the following steps: (I)
SubBytes (S-Box applied to each entry) ShiftRows (bytes within each row of A are shifted 0 to 3 columns) MixColumns (a matrix multiplication; absent in last round) AddRoundKey {A is XORed with a round key)
The round function for decryption uses the inverse function for each of SubBytes, ShiftRows and MixColumns. Encryption consists of the following steps. AES's Encryption Function: AddRoundKey f o r ( i = l ; i < 10;++i) { SubBytes ShiftRows MixColumns AddRoundKey } SubBytes ShiftRows AddRoundKey
50
CRYPTOGRAPHICS
In the SubBytes step, each byte is used as an index into a table and the byte is replaced with the table entry. In block ciphers, tables used for such substitutions are referred to as S-Boxes. Tables 4.1 and 4.2 contain the S-Boxes for AES encryption and decryption, respectively. The table lookup is performed by viewing the byte as two 4-bit values, with the leftmost 4 bits used as the row index and the rightmost 4 bits used as the column index. For example, 0x29 uses the row corresponding to 2 and column corresponding to 9. 0x29 is replaced with 0xa5 when encrypting and 0x4c when decrypting. 1 ~ol r63~ 7c 1 1 ca 82 2 b7 fd 3 04 c7 4 09 83 5 53 dl 6 dO ef 7 51 a3 8 cd Oc 9 60 81 a eO 32 b eV c8 c ba 78 d 70 3e e el f8 f 8c al 0
2 77 c9 93 23 2c 00 aa 40 13 4f 3a 37 25 b5 98 89
3 7b 7d 26 c3 la ed fb 8f ec dc Oa 6d 2e 66 11 Od
4 f2 fa 36 18 lb 20 43 92 5f 22 49 8d Ic 48 69 bf
5 6b 59 3f 96 6e fc 4d 9d 97 2a 06 d5 a6 03 d9 e6
6 6f 47 f7 05 5a bl 33 38 44 90 24 4e b4 f6 8e 42
7 c5 fO cc 9a aO 5b 85 f5 17 88 5c a9 c6 Oe 94 68
8 30 ad 34 07 52 6a 45 be c4 46 c2 6c e8 61 9b 41
a b 9 01 ^"67" 2b d4 a2 af a5 e5 fl 12 80 e2 3b d6 b3 cb be 39 f9 02 7f b6 da 21 a7 7e 3d ee b8 14 d3 ac 62 56 f4 ea dd 74 If 35 57 b9 le 87 e9 99 2d Of
c fe 9c 71 eb 29 4a 50 10 64 de 91 65 4b 86 ce bO
d d7 a4 d8 27 e3 4c 3c ff 5d 5e 95 7a bd cl 55 54
e ab 72 31 b2 2f 58 9f f3 19
Ob e4 ae 8b Id 28 bb
f 76 cO 15 75 84 cf a8 d2 73 db 79 08 8a 9e df 16
Table 4.1. AES S-Box for Encryption In the ShiftRows step, the entries in the i^^ row of the matrix A are rotated i positions to the left, for i = 0 to 3, when encrypting. Specifically, the matrix
A= I aoo aio ^20 \ ^30
^01 ail ^21 a3i
ao2 ^12 <^22 a32
ao3 ^ ^13 ^23 a33 /
aoo ail a22 a33
aoi ai2 a23 a3o
ao2 ai3 a2o a3i
ao3 \ aio a2i a32 /
becomes
Encryption in GPUs
0 1 7 3 4 5 6 7 8 9 a b c d e f
0 5?, Ic 54 08 72 6c 90 dO 3a 96 47 fc If 60 aO 17
1 09 e3 7b 2e f8 70 d8 2c 91 ac fl 56 dd 51 eO 2b
2 6a 39 94 al f6 48 ab le 11 74 la 3e a8 7f 3b 04
51 3 d5 8? 3?, 66 64 50 00 8f 41 22 71 4b 33 a9 4d 7e
4 30 9b a6 28 86 fd 8c ca 4f e7 Id c6 88 19 ae ba
b 36 ?f r? d9 68 ed be 3f 67 ad 29 d2 07 b5 2a 77
6 a5 ff 23 24 98 b9 d3 Of dc 35 c5 79 c7 4a f5 d6
7 38 87 3d b2 16 da Oa 02 ea 85 89 20 31 Od bO 26
8 bf 34 ee 76 d4 5e f7 cl 97 e2 6f 9a bl 2d c8 el
9 40 8e 4c 5b a4 15 e4 af f2 f9 b7 db 12 e5 eb 69
a a3 43 95 a2 5c 46 58 bd cf 37 62 cO 10 7a bb 14
b 9e 44 Ob 49 cc 57 05 03 ce e8 Oe fe 59 9f 3c 63
c 81 c4 42 6d 5d a7 b8 01 fO Ic aa 78 27 93 83 55
d f3 de fa 8b 65 8d b3 13 b4 75 18 cd 80 c9 53 21
e d7 e9 c3 dl b6 9d 45 8a e6 df be 5a ec 9c 99 Oc
f fb cb 4e 25 92 84 06 6b 73 6e lb f4 5f ef 61 7d
Table 4.2. AES S-Box for Decryption
When decrypting, the rotation is reversed. The MixColumns step consists of the matrix multiplication M^ * A when encrypting where M^ is the constant matrix: M.
/ 02 01 01 V 03
03 02 01 01
01 01 \ 03 01 02 03 01 02 )
MixColumns uses the inverse of Mg when decrypting and consists of the matrix multiplication Md * A where Md is the constant matrix: Md =
( OE 09 OD V 05
OB OE 09 OD
OD OB 0^ 09
09 \ OD OB 0^ /
A faster implementation for environments with sufficient memory operates on 32-bit words and reduces the AES round function to four table lookups and four
52
CRYPTOGRAPHICS
XORs. If A denotes a 4x4 matrix input to the round, aij denotes the i*^ row and j ^ ^ column of A,j-\-x is computed modulo 4, and Tk, for /c = 0 to 3, are tables with 256 32-bit entries, the round function is reduced to the form: (II)
A'j = TO[aoj] e Tl[aij+i]
0 T2[a2,j4-2] eT3[a3,j+3] 0
RoundKey
Where A', denotes the j ^ ^ column of the round's output, for j = 0 to 3. Refer to pages 58-59 of [19] for a complete description and the derivation of this version. The entries in the tables in (II) are concatenations of 1, 2, and 3 times the S-Box entries. This version is due to the fact that the order of the SubBytes and ShiftRows steps can be switched and the MixColumn step can be viewed as the linear combination of four column vectors, which is actually a linear combination of the S-Box entries. AES 's key schedule expands the key to eleven 128-bit round keys used for the AddRoundKey steps (10 rounds plus the initial AddRoundKey). Each round key is viewed as four 32-bit words. The key schedule creates the eleven round keys as an array of forty four 32-bit words. The 128-bit key is split into four 32bit words to form the first four array entries. Each remaining word is formed by XORing two previous words or by performing an S-Box lookup on a previous word then XORing it with a constant and a previous word. The following is pseudo code for the key schedule when using 128-bit keys. Refer to [56] for a general description of the key schedule that processes 128, 196 and 256-bit keys. Notation: • EK is the array of 32-bit words containing the expanded key. • X is a word • concat(a,b,c,d) indicates the concatenation of the inputs to form the bitstring of abed. • X < < < 8 means to rotate x to the left by 8 bits • SubWord(x) applies the S-Box used in round function to each byte of the word X. • C = [0x01000000,0x02000000,0x04000000,0x08000000,0x10000000, 0x20000000,0x40000000,0x80000000,0xlb000000,0x36000000] is an array of constants. AES's Key Schedule for 128-bit Keys: /* Place the 128-bit key in the first 4 entries of EK */ for (i=0; i < 4; ++i) {
Encryption in GPUs
53
EK[i] = concat(K[4*i] K[4*i+1] K[4*i+2] K[4*i+3])
} /* The first word of each remaining round key is formed from the XOR of a S-Box entry, a constant and a previous word. The second to fourth words of each remaining round key is the XOR of two previous words. */ for (i=4; i < 44; ++i) { x = EK[i-l] if (i mod 4 == 0) { X = SubWord((x < « 8)) XOR C[i/4] } EK[i] = EK[i-4] XOR x
}
4.5.2
AES in OpenGL
This section explains how AES can be implemented in OpenGL^. The AES round function cannot easily be implemented in OpenGL as the standard series of four steps. The SubBytes step can be performed using a colormap, the ShiftRows step can be performed by copying pixels to different locations and AddRoundKey can be performed by copying pixels containing the round key and pixels containing the data to the same location with XOR turned on. However, the MixColumn step would have to be expanded to a series of colormaps and copying of pixels to perform individual multiplications and additions due to the lack of a corresponding integer matrix multiplication with modular arithmetic in OpenGL. The view of AES as four table lookups and XORs also cannot be implemented in OpenGL due to the lack of a suitable 32-bit data structure. While the RGBA format is 32 bits, it is not possible to use all 32 bits as an index into a colormap or to swap values between components, both of which would be necessary to implement the version in (II). Even though GPUs support 32-bit floating point values and basic arithmetic operations on them, floating point values are represented as a sign, mantissa and exponent. The IEEE 754 32-bit floating point format uses a 1-bit sign, 8-bit exponent and 23-bit mantissa. This prevents the 32-bit floating point value from being used to store data that is to be interpreted as a single sequence of 32 bits (such as by using an unsigned 32-bit integer). Rounding error also makes the use of floating point operations unsuitable for the exactness required by ciphers. Since neither (I) nor (II) can be implemented directly in OpenGL, an intermediate step in the transformation of the standard algorithm in (I) to the version in (II) is used. Letting A'j and aij be defined as in (II) and letting S[aij] denote
54
CRYPTOGRAPHICS
the S-Box entry corresponding to aij, the encryption round function for rounds 1 to 9 is represented as: (III) A^j = ( 025[aoj] \ 015[aoj] 015[aoj]
/ 035[ai,,+i] \ 025[aij+i] 015[aij+i]
V 035[aoj] /
V 015[aij+i] /
/ 015[a2j+2] \ 035[a2j+2] 025[a2j+2] \ 015[a2j+2] /
/ 015[a3j+3] \ Ol5[a3jHH3] 0 035[a3,,+3]
Roundkey
V 025[a3,,+3] /
If three tables, representing 1, 2, and 3 times the S-Box entries are stored, (III) reduces to a series of table lookups and XORs. This allows AES to be implemented using colormaps and copying of pixels. The 10*^ round is implemented as (III) with all the coefficients of 2 and 3 replaced by 1 because there is no MixColumns step in the 10*^ round. Since decryption uses the inverses of the S-Box and matrix multiplication, five tables need to be stored, representing OE, OB, OD, 09 and 01 times the S-Box inverse. Notice that this representation of AES processes data as individual bytes, instead of 4-byte words. However, the manner in which the pixel components are utilized in the implementation when encrypting multiple blocks allows 4 bytes to be processed simultaneously per pixel, compensating for the performance loss due to not being able to use 32-bit words as in (II). When creating the tables, GPUs' use of floating point values and rounding had to be taken into consideration. The S-Box entries and their multiples would ideally be stored as integers. However, they correspond to values for a color component of a pixel when implemented in a GPU. Even though the color components of a pixel are viewed as a set of bits when read to and from system memory, they are operated on as floating point values in the GPU. A colormap is represented as a table of floating point values in the range of 0 to 1. In integer format, the S-Box entries are integers in the range of 0 to 255. The multiplication is computed modular 256. When converting the multiples of the S-Box to colormaps, each entry is divided by 255 to convert the integers to the range of 0 to 1. Then 0.000001 is added to all values except 0 and 1. When a color component is replaced by the value from the colormap, this results in the color component being 0.000001 greater then the actual value when represented as a floating point. The value of a color component is converted from a floating point to an integer in order to index into the colormap. When the value is converted to an integer, it will be converted to the nearest
Encryption in GPUs
55
integer less than or equal to the floating point value. Therefore, the addition of 0.000001 when populating the colormap entries avoids error due to rounding when converting a color component to an integer to index into the colormap. The floating point to integer conversion also occurs if the pixels are written to system memory after the block cipher has been applied, as is the case when the GPU is being used only to execute the block cipher, and the result is used outside of the GPU. In the OpenGL version of AES, encryption was implemented as the following steps. The code is provided in Appendix A. The DATA, KEY, 01, 02 and 03 areas are the pixel areas shown in Figure 4.6 in Section 4.5.3. Steps in the OpenGL Version of AES's Encryption Function: /* setup */ Define static colormaps corresponding to 1, 2, 3 times the S-Box entries. main { Set mode for operations to be performed in the back buffer. Load the data into the DATA area. /* initial whitening */ Load the expanded key into the KEY area. Turn the logical operation of XOR on. Copy the first key from the KEY area to the DATA area. Turn the logical operation XOR off. /* rounds 1 to 9 */ for (i=0; i < 9; ++i) { /* Compute 1,2,3 times the S-Box entry for each byte */ Copy the DATA area: to the 01 area with the colormap corresponding to l*S-Box tumed on to the 02 area with the colormap corresponding to 2*S-Box tumed on to the 03 area with the colormap corresponding to l*S-Box tumed on Turn colormapping off Copy the pixels from areas 01,02,03 corresponding to the first term on the right hand side of (III) to the DATA area. Tum the logical operation of XOR on.
56
CRYPTOGRAPHICS /* Compute the XOR of the first 4 terms in (III) */ Copy the pixels from areas 01,02,03 corresponding to the second term on the right hand side of (III) to the DATA area. Copy the pixels from areas 01,02,03 corresponding to the third term on the right hand side of (III) to the DATA area. Copy the pixels from areas 01,02,03 corresponding to the fourth term on the right hand side of (III) to the DATA area. /* end of round whitening */ Copy the ith round key from the KEY area to the DATA area. Turn the logical operation XOR off. } /* end of rounds 1 to 9 */ /* last round */ Copy the DATA area to the 01 area with the colormap corresponding to l*S-Box turned on. Turn colormapping off. Copy the pixels from the 01 area back to the DATA area in the order corresponding to ShiftRows. /* final whitening */ Turn the logical operation of XOR on. Copy the last round key from the KEY area to the DATA area. Turn the logical operation XOR off. /* Send the result to the display or to an application */ Swap the data area to the front buffer to display it to the user or read the data area to system memory if the data is being encrypted for an application outside of the GPU. } /* end of main */
Decryption is implemented in the same manner, except the coefficients of the S-Box used in equation (III) are 09, OB, OD and 0 ^ instead of 01,02 and 03 to correspond to the matrix used in the MixColumns step during decryption. The column indices are computed using subtraction in place of addition to reverse the shift used for encryption (i.e., aij-i instead of a i j + i ) . As mentioned in Chapter 2, since the OpenGL version of AES was implemented, CPUs that support 64-bit pixels with 16 bits per color component have become available. When using the 64-bit format, the OpenGL version of AES
Encryption in GPUs
57
will still be defined as in equation (III). However, now 2 bytes can be stored per color component and processed in parallel by defining the tables to contain entries corresponding to two identical 8-bit halves. This doubles the amount of data that can be processed in parallel. AES's key schedule can be implemented in OpenGL using colormaps, XORs and copy commands on 32-bit pixels. In this scheme, a single color component of series of pixels is used to store the bytes of the expanded key. For example, one word of the expanded key is stored in the red color component of 4 adjacent pixels. The rotation of the word x is performed by changing the location of the pixels using the copy command. The S-Box lookup is performed using a colormap as described for the OpenGL version of the encryption function. The array C can be stored in pixels. All XORs are performed by turning the logical operation of XOR on then copying the pixels involved in the XOR to the same location. The following is pseudo code for the OpenGL version of AES's key schedule for 128-bit keys. In the pseudo code, copying a 32-bit value means copying four consecutive pixels that contain the 32-bit value in a 8-bit color component. The conditional test for the for loop and the if statement can be executed in the CPU, with the body of the loop and if statement being OpenGL commands executed on the GPU because there is no information about the expanded key value conveyed in these two conditional statements. Steps in the OpenGL Version of AES's Key Schedule: /* setup */ Create a colormap corresponding to the byte-level substitutions performed in SubWords. This colormap is created using the S-Box values from the SubBytes step in encryption entered in a manner corresponding to the input being rotated to the left 8 bits. Notice that there is no need to perform the rotation during the execution of the key schedule because it is incorporated into the table entries. Write the values of the array C to a set of pixels in the framebuffer. Recall that the array C was defined in Section 4.5.L Write the 128-bit key to a single 8-bit color component of 16 consecutive pixels in the framebuffer. This is EK[0] to EK[3]. /* Expand key */ for (i=0; i < 4; ++i) { copy EK[i] to a temporary area of pixels, tmp if(imod4==0){
58
CRYPTOGRAPHICS turn the colormap on copy tmp (contains EK[i]) to itself turn the colormap off turn XOR on copy pixels containing C[i/4] to tmp turn XOR off
} copy EK[i-4] to the location for EK[i] turn XOR on copy tmp to the location for EK[i] turn XOR off
4.5.3
AES Experiments
This section describes an implementation of AES's encryption and decryption functions for 128-bit blocks that works with any GPU supporting 32-bit pixels and OpenGL. The key schedule was not implemented inside the GPU since this experiment was measuring the encryption rate only. Instead, the expanded key was written to the GPU. The OpenGL version of the key schedule is described in Section 4.5.2. While the GPU allows for parallel processing of a large number of blocks, due to the simplicity in which AES can be implemented in software as a series of table lookups and XORs, the overall encryption rate using the GPUs tested is below the rate that can be obtained with a C implementation utilizing only system resources. However, implementation demonstrates that encryption within the GPU is possible. Since the experiments were conducted, there has been an increase in the pixel size supported by GPUs..The increase will allow for two to four times as many data blocks to be encrypted simultaneously in the GPU compared to the quantities encrypted in the experiments. The code consisted of C, OpenGL and GLUT. The C portion of the code sets up the plaintext and key. The OpenGL and GLUT commands are issued from within the C program. GLUT commands are used to open the display window. All of the encryption and decryption computations are performed with OpenGL functions, with data being stored and processed as pixels. The representation of AES defined in equation III in Section 4.5.2 was used. The implementation allows encrypting 4 * n blocks simultaneously, where n is the number of pixels utilized for the data being encrypted or decrypted and may be any integer less than the display's maximum pixel height supported by the GPU. The encryption of multiple blocks simultaneously from the same plaintext is useful if ECB or CTR mode are used. Alternatively, one block from several messages can be processed in parallel.
59
Encryption in GPUs
f
< <
n pixels
EXI • A N l )ED KEY
o
11 o
1
en o
[ 16 pixels
work area
Figure 4.6. Layout of Data in Pixel Coordinates used in the OpenGL Version of AES
Figure 4.6 illustrates the pixel coordinates utilized by the algorithm. The initial data blocks are read into the 16 x n area starting at the origin, indicated by "DATA" in the diagram. One byte of data is stored in each pixel component, allowing us to process 4 * n blocks of data when all of the RGB A components are used. The i^^ column contains the i^^ byte of each block. This area is also used to store the output from each round. To maximize throughput, for each data block one copy of the expanded key is read into the area labeled "KEY" in the diagram. This area is 176 x n pixels starting at (17, 0) and the round keys are stored in order, each encompassing 16 columns. The tables are stored as colormaps and do not appear in the layout. The data stored in the first 16 columns is copied 3 times for encryption and 5 times for decryption, applying a colormap each time. The results are stored in the areas indicated by the hex values in the diagram and are computed per round. The values in parenthesis indicate the location of the transformations for decryption. The hex value indicates the value by which the S-Box (or inverse S-Box, when decrypting) entries are multiplied. Figure 4.7 shows an example of the resulting display when the front buffer and RGB components are used to encrypt 300 identical data blocks simultaneously. Refer to Appendix A for the OpenGL code for encryption.
60
CRYPTOGRAPHICS
Two C implementations of AES are used for comparison. The first is the AES representation corresponding to variant (I) in Section 4.5.1, with the multiplication steps performed via table lookups, and reflects environments in which system resources for storing the tables required by variant (II) are not available. The second is a C implementation of variant (II) that offers increased encryption and decryption rates over (I) at the cost of requiring additional memory for tables. The code for (II) is a subset of [65]. The rate of encryption provided with the GPU is compared to that provided by the C implementation running on the system CPU.
11 t
. i 1
\
-1
\ \ \
li
\ \ \
Figure 4.7. Encryption of 300 Identical Blocks in RGB Components
For the implementations, OpenGL is the API to the graphics card driver. All of the programs use basic OpenGL commands and have been tested with OpenGL L4.0. No vendor-specific extensions are used, allowing the program to be independent of the GPU. The GPU must support 32-bit "true color" mode, because 8-bit color components are required for placing the data in pixels. At a minimum, one color component and at a maximum all four of the RGBA components are utilized by the programs. Since the experiments have been conducted, OpenGL 2.0 and the OpenGL shading language were released, but
Encryption in GP Us
61
these offer no additional pixel processing functionality that is beneficial to the experiments. The implementation of AES can be set to work with one to four pixel components. When the result is intended for an application outside of the GPU, to avoid displaying the pixels to the window as the encryption is occurring, the display mode can be set to use a front and back buffer, with the rendering performed in the back buffer and the results read directly from the back buffer to system memory and never displayed on the screen. Otherwise, the result can be swapped to the front buffer to be displayed to the user. The support for the alpha component in the back buffer is optional in OpenGL; therefore, it may be necessary to perform rendering in the front buffer and display the pixels to the screen when utilizing all of the RGB A components. All tests were performed in three different environments, then a subset of the tests were run in other environments to verify the correctness of the implementations with additional GPUs. The environments were selected to represent a fairly current computing environment (at the time the experiments were performed), a laptop and a low-end PC. Nvidia and ATI cards were used to illustrate that the implementations worked with different brands of cards. However, our intend was not to compare the performance of the different graphics cards. The three environments used for all tests are: 1 A Pentium IV 1.8 Ghz PC with 256KB RAM and an Nvidia GeForceS Ti200 graphics card with 64MB of memory. The operating system is MS Windows XR 2 A Pentium Centrino 1.3 Ghz laptop with 256KB RAM and an ATI Mobility Radeon 7500 graphics card with 32MB of memory. The operating system is MS Windows XR 3 A Pentium III 800 Mhz PC with 256KB RAM and an Nvidia TNT32 M64 graphics card with 32MB of memory. The operating system is MS Windows In all cases, the display was set to use 32-bit true color and full hardware acceleration. Aside from MS Windows and, in some cases a CPU monitor, no programs other than that required for the experiment were running. The CPU usage averaged around 8% in each environment with only the operating system and CPU monitor running. All code was compiled with Visual C++ Version 6.0. The implementations required opening a display window, although computations may be performed in a buffer that is not visible on the screen. The window opened by the program is positioned such that it does overlap with the window from which the program was executed and to which the output of the program is written. The reason for this positioning is that movement of the display window or overlap with another active window may result in a slight
62
CRYPTOGRAPHICS
decrease in performance and can interfere with the results. GLUT commands were used to open the display window. The other GPUs on which the programs were tested included an Intel© 82845G/GL Graphics Controller on a 2.3 Ghz Pentium IV processor running MS Windows XP, and a Nvidia GeForce4 Ti 4200 on a Pentium III 1.4 Ghz processor running MS Windows 2000. The AES implementation was also tested using a GeForceS Ti200 graphics card with 64MB of memory with XI1 and Redhat Linux 7.3. In order to determine configuration factors impacting performance, a series of initial tests were run with the OpenGL implementations of AES while holding the number of bytes encrypted constant. First, since the implementation required a GPU that was also being utilized by the display, the refresh rate for the display was varied, but that did not affect performance. Second, the screen area (not the number of pixels utilized for the cipher) was varied from 800x600 to 1600x1200. This also did not affect performance, and in the results cited for AES, the screen area was set to the minimum of 800x600 and the dimension that accommodated the number of pixels required by the test. Third, the use of a single buffer with the pixels displayed to the screen versus a front and back buffer with all work performed in the back buffer and not displayed to the screen was tested. Again, there was no change in the encryption rate. A fourth test was run to determine if there was any decrease in performance by using the GLUT or GLX libraries to handle the display. GLX is the X Window System extension to support OpenGL. In the test, two versions of the program were executed, one using GLUT and one using GLX with direct rendering, from a server with a Pentium III running Redhat Linux 7.3. There was no noticeable difference between the rates from the GLUT and GLX versions of the program. When describing the results, AES-GL indicates the implementation using OpenGL and AES-C indicates the C implementations, with the specific variant from Section 4.5.1 indicated by I (the SubBytes, ShiftRows, MixColumns, AddRoundKey representation) and II (the table representation). The AES-C programs have a hard-coded key and single 128-bit block of data. The programs expand the key then loop through encrypting a single block of data, with the output from the previous iteration being encrypted each time. No data is written to files and the measurements exclude the key setup (which is common for all variants). The AES-GL program uses a hard-coded expanded key and one or four blocks of data in the cases when the red or RGBA pixel components are used, respectively. Both the key and data are read in n times to provide n copies. Similar to the AES-C programs, the AES-GL program loops through encrypting blocks of data, with the output from the previous iteration being encrypted each time. The times exclude reading in the initial data and key, and no data is read from or written to system memory during the loop. Trials were conducted with the values of n ranging from 100 to 600 in increments of 100.
Encryption in GPUs
63
The rates for values of n > 300 varied by less than 2% and the rates across all values of n varied by at most 8%. The results for AES-GL in Table 4.3 are the averages over n > 300 when a single pixel component and all of the RGBA pixel components are utilized. PC and GPU SOOMhz Nvidia TNT2 1.3Ghz ATI Mobility Radeon l.SGhz Nvidia GeForce3
AES-GL R 184Kbps 55Kbps 380Kbps
AES Version AES-C AES-GL RGBA (I) 1.68Mbps 732Kbps 278.3Kbps 2.52Mbps 1.53Mbps 3.5Mbps
AES-C (11) 30Mbps 45Mbps 64Mbps
Table 4.3. Encryption Rates for AES
The layout of the pixels was chosen to simplify indexing while allowing for a few thousand blocks to be encrypted simultaneously. Since the layout does not utilize all of the available pixels, the number of blocks encrypted at once can be increased if the display area is utilized differently. For example, if the number of blocks is n^, the layout can be altered such that the various segments are laid out in n x n areas instead of as columns. Performance recommendations for OpenGL include processing square regions of pixels as opposed to processing narrower rectangles [89]. A modification of the program that performed the same number of steps on square regions instead of the configuration shown in Figure 4.6 was also tested. There was no change in the encryption rate, most likely because the program appears to be CPU bound. Furthermore, using square areas makes indexing more difficult and requires the number of blocks to be a perfect square for optimal utilization of the available pixels. With the two Nvidia graphics cards, AES-GL's encryption rate was just under 50% that of AES-C (I). However, when compared to AES-C (II), the AES-GL rate was 2.4% of the AES-C version. The ratio was lower in both cases when using the ATI Mobility Radeon graphics card, with the AES-GL encryption rate being 11% of AES-C (I)'s rate and less than 1% of AES-C (II)'s rate. To determine the factors affecting AES-GL's performance, additional tests were performed in which AES-GL and AES-C were run while monitoring system resources. When using either AES-C or AES-GL, the CPU utilization is 100% for the duration of the program. While we expect high CPU utiUzation for AES-C, the result is somewhat counter-intuitive for AES-GL. The CPU usage is due to the rate at which commands are being issued to the graphics card driver. Due to the simplicity in which AES is represented, a single OpenGL command resulted in one operation from AES being performed: either the table
64
CRYPTOGRAPHICS
lookup or the XORing of bytes. The time required of the GPU to execute a command resulted in no idle time for the CPU before the GPU was ready for the next command. The difference between the AES representations used by AES-GL and AESC is not considered to be a factor. While the representation of AES used in AES-GL processes data as individual bytes instead of as the 32-bit words used in AES-C (II), even when excluding the processing of n pixels simultaneously the use of the RGB A components allows 4 bytes to be processed simultaneously per pixel, compensating for the loss of not being able to use 32-bit words when encrypting multiple data blocks in parallel. The disadvantage of using a representation of AES based entirely on table lookups and XORs in the GPU is the need to use colormaps and copying, which are two of the slowest operations in GPUs.
4.5.4
Use of Parallel Processing in Attacks
In the AES experiments, the same key was used for all data blocks. Instead of using n copies of the expanded key, n different expanded keys can be used. In this sense, the GPU provides a mechanism by which an adversary trying to decrypt a block of ciphertext without knowing the key can try n keys simultaneously. First, n keys would need to be expanded simultaneously by implementing the key schedule in the GPU. Then n copies of the ciphertext can be written to the GPU and simultaneously decrypted with the n keys. While the number of pixels in a GPU does not allow for an exhaustive search of all 2^^^ keys to be feasible, the parallel processing can aid an adversary who knows part of the key and is trying to determine the remaining bits. If the height of a display is 1024 pixels and each color component is used, only 2^^ keys can be tried simultaneously when using one copy of the configuration shown in Figure 4.6. The configuration is 272 pixels wide. The width of typical display sizes allows for two to four copies of this configuration. The 128 bits per pixel support that has become available since the OpenGL version of AES was written will at most increase the number of keys that can be tried simultaneously by a factor of 4 when using all color components. Therefore, using a display that is 1024 pixels high and at least 1088 pixels wide (4 x 272), with each color component used and 128-bit pixels allows for 2^^ 16-byte blocks to be decrypted simultaneously, each with a different key.
4.6 GPUs and Stream Ciphers 4.6.1 Overview As a first step in evaluating the usefulness of GPUs for implementing cryptographic primitives, the mixing component of a stream cipher (the XOR operation) was performed inside the GPU. GPUs have the ability to XOR many
Encryption in GPUs
65
pixels simultaneously, which can be beneficial in stream cipher implementations. For applications that pre-compute segments of key streams, a segment can be stored in an array of bytes that is then read into the CPU's memory and treated as a collection of pixels. The data to be encrypted or decrypted are also stored in an array of bytes that is read into the same area of the GPU's memory as the keystream segment, with the logical operation of XOR enabled during the read. The data can be written to the GPU as it arrives, but doing so may result in a decrease in performance because multiple writes containing one or a few pixels is less efficient than issuing one write involving a large number of pixels. The result of the XOR is written to system memory if it is needed by another application or, in the case of a display being decrypted, is displayed to the user. Overall, XORing the data with the key stream requires two reads of data into the GPU from system memory and one read from the GPU to system memory. The number of bytes can be at most three times the number of pixels supported if the data is processed in a back buffer utilizing only RGB components. The number of bytes can be four times the number of pixels if the front buffer can be used or the back buffer supports the alpha component. If the key stream is not computed in the GPU, the cost of computing the key stream and temporarily storing it in an array is the same as in an implementation not utilizing a GPU.
4.6,2
Experiments
The rate at which data can be XOR'ed with a key stream in an OpenGL implementation is compared to that of a C implementation (Visual C++ 6.0). The tests were conducted using a PC with a l.SChz Pentium IV processor and an Nvidia GeForceS graphics card, a laptop with a 1.3Ghz Pentium Centrino Processor and an ATI Mobility Radeon graphics card, and a PC with a SOOMhz Pentium III Processor and an Nvidia TNT2 graphics card. These are the same environments used when testing the OpenGL version of AES. The results from the C implementation are in Table 4.4. Several data sizes were tested to determine the ranges for which the OpenGL implementation would be useful. As expected, the benefit of the GPU's simultaneous processing is diminished if the processed data is too small. Tables 4.5 and 4.6 indicate the average encryption rates over 10 trials of encrypting 1000 data segments of size 3Y^ and 4Y^, respectively, where the area of pixels is y by y . For the number of pixels involved in the images, the transfer rate to the GPU was measured to be equal to the transfer rate from the GPU, thus each read and write contributed equally to the overall time. Notice that the encryption rate was fairly constant for all data sizes on the slowest processor with the oldest GPU (Nvidia TNT2). Possible explanations include a slow memory controller, memory bus, or GPU, although this was not investigated further. With the GeForceB Ti200 card, the efficiency increased as
66
CRYPTOGRAPHICS
XOR Rate
1.8 Ghz 139MB/S
CPU 1.3 Ghz 93.9MB/S
800 Mhz 56MB/S
Table 4.4. XOR Rate Using System Resources (CPU)
Area (in pixels) 50x50 100x100 200x200 300x300 400x400 500x500 600x600
Using RGB Components Nvidia ATI Mobility Nvidia TNT2 GeForce3 Ti200 Radeon 7500 27.8MBps 23.5MBps 35.7MBps 28.8MBps 38.5MBps 53.4MBps 26.0MBps 45.5MBps 64.5MBps 26.0MBps 45.0MBps VO.lMBps 27.0MBps 43.0MBps 75.4MBps 26.6MBps 38.0MBps 77.3MBps 27.7MBps 41.7MBps 81.2MBps
Table 4.5. XOR Rate Using CPUs - RGB Pixel Components
Area (in pixels) 50x50 100x100 200x200 300x300 400x400 500x500 600x600
Using RGBA Components Nvidia ATI Mobility Nvidia TNT2 GeForce3 Ti200 Radeon 7500 37.0MBps 26.3MBps 49.3MBps 38.4MBps 38.1MBps 69.2MBps 32.0MBps 45.7MBps 86.8MBps 32.0MBps 42.3MBps 94.8MBps 32.8MBps 49.0MBps 95.9MBps 32.6MBps 37.0MBps 97.5MBps 32.8MBps 41.5MBps 105.0MBps
Table 4.6. XOR Rate Using CPUs - RGBA Pixel Components
more bytes were XOR'ed simultaneously. On the laptop the peak rates were obtained with 200x200 to 400x400 square pixel areas. When using the RGB components, the highest rate obtained by the GPUs compared to the C program is 58% for the Nvidia GeForceS Ti200 card, 48.5% for the ATI Mobility Radeon card, and 51.4% for the Nvidia TNT2 card. With both the GeForce3 Ti200 and the ATI Radeon cards, results with the 50x50 pixel area was significantly slower than with larger areas due to the time to read
Encryption in GPUs
67
data to/from system memory representing a larger portion of the total time. In both cases the rate is approximately 25% of that of the C program. When using the RGBA components, the highest rates on the Nvidia GeForce Ti200, ATI Radeon and Nvidia TNT2 cards are 75.5%, 52% and 68% of the C program, respectively.
4.7
Conclusions
The AES experiments prove it is possible to perform encryption within a GPU while illustrating the difficulty in moving existing block ciphers into the GPU. The lessons learned from developing the OpenGL version of AES and from considering other symmetric key ciphers indicate GPUs are suitable for some, but not all symmetric key ciphers. GPUs can be used to offload a shared system CPU in applications using stream ciphers and which allow large segments of data to be combined with the key stream simultaneously. Asymmetric key ciphers are not implementable in existing GPUs, primarily due to the magnitude of the parameters involved. Since it is feasible to perform encryption within a GPU, encryption and decryption of graphical displays and images may be moved into the GPU to avoid temporarily storing an image as plaintext in system memory. A prototype of this concept is presented in the next chapter. The prototype demonstrates the use of GPU-based decryption to protect displays in two applications from malware on an untrusted system. As GPU processing power and capabilities continue to increase, the potential uses will also increase. Expanding accessibility to internal GPU functions by enhancing existing APIs may be beneficial, allowing applications to take advantage of capabilities currently not available through existing APIs.
Notes 1 RC4 was designed by Ronald Rivest of RSA Security. It was never officially published under the name RC4 but is commonly used and well analyzed. It was published as an IETF draft in 1999 under the name Arcfour. Refer to h t t p : / / e n . w i k i p e d i a . org/wiki/RC4/ for a brief history and definition. 2 The implementation of AES described in Section 4.5.2 and the experiments described in Section 4.5.3 were first presented in [17].
Chapter 5 REMOTELY KEYED CRYPTOGRAPHICS
5.1
Overview
Chapter 4 addressed the feasibiUty of performing encryption and decryption within a GPU. The ability to execute a cipher within a GPU by itself is insufficient to utilize GPU-based encryption when the intent is to protect the display from an untrusted client. The applications mentioned in Chapter 3 have other issues that must be addressed aside from how and where the display or images are decrypted. In this chapter, thin-client remote desktops and remote video conferencing applications are considered. Issues such has how to convey a cipher's key to the GPU and how to authenticate the entities involved are addressed. An architecture and prototype for these applications is described. The general scenario under consideration is shown in Figures 5.1 and 5.2. The user wishes to view images (such as his/her desktop or video from a teleconference) on an untrusted client. The images are sent encrypted from a trusted server. The client may have malware running on it that can access data in the system's RAM and send it to an adversary. If the data is decrypted by a process running on the client's operating system, the malware can access the plaintext and send it to the adversary or write it to the system's hard disk for the adversary to retrieve later (Figure 5.1). When the GPU decrypts the data, the malware can only access the encrypted image and send the adversary ciphertext (Figure 5.2). In Chapter 6, related issues regarding protection of user input at the remote client are described and potential solutions proposed. An alternative method for keying the GPU is also provided.
5.2
Keying of GPUs
When using a GPU to encrypt data on a trusted system, the key can simply be entered into the system then written to the GPU. This is how the OpenGL
70
CRYPTOGRAPHICS
Account 12345678 asset allocation
Server sends images to the remote client.
Data from server is decrypted on client's OS.
Malware on the remote client's OS copies data and sends to adversary.
Account 12345678 asset allocation
Figure 5.1. Malware on Untrusted Client with OS-based Decryption
version of AES is keyed both when the 128-bit key is given to the GPU to expand and when the already expanded key is given to the GPU. This method of keying the GPU is inadequate when the GPU that is performing the encryption or decryption is on an untrusted system. Instead, the key must be conveyed to the GPU without the operating system being able to gain access to the key. It is not feasible to hard code or otherwise embed a secret key within the GPU because the GPU is intended to be used with multiple servers and applications, each of which will use a distinct key and possibly multiple keys per session. Two possibilities for keying the GPU are to install a certificate in the GPU containing values that allow the use of public key-based protocols or to provide a method for a user to input a secret key into the GPU without the key passing through the operating system. The installation of a certificate in a GPU by the manufacturer is a reasonable assumption consistent with the information the Trusted Computing Group recommends be installed by software and hardware manufacturers as described in Chapter 6. If a certificate is installed in the GPU containing an RS A public-private key pair, as in the case of the prototype, a server or other entity can encrypt the secret
71
Remotely Keyed CryptoGraphics
?%'^ma(*'i
Server sends images to the remote client.
Malware on the remote Ghent's OS can't access the unencrypted data. ( Data from server is decrypted in chent's GPU. Account 12345678 asset allocation
Figure 5.2. Malware on Untrusted Client vi^ith GPU-based Decryption
key with the CPU's public key and send it to the operating system to write to the GPU. Assuming the GPU has the ability to perform public key cryptography, the GPU then decrypts the secret key. An alternative to the use of public key cryptography is to have the user click on colored blocks displayed on the GPU. The pixel values corresponding to the colors selected are used as the key. Recording the mouse clicks or keyboard input used to select the blocks provides no information to spy ware if the pixel values cannot be read from the GPU by the spyware. An image of colored blocks can be stored in the GPU and displayed to the user in constant motion so even if the spyware read the image when it was loaded into the GPU, the spyware would not know the location of a specific block at the time of each selection. This idea for keying the GPU is discussed further in Chapter 6.
72
CRYPTOGRAPHICS
5.3 Prototype 5.3.1 Purpose The purpose of the prototype is to illustrate the concept of CryptoGraphics for certain applications. The prototype represents how and where certain functions can be performed, and what issues need to be addressed. Explanations of the implementation decisions are included in order to convey the obstacles encountered (both GPU and non-GPU related) and the limitations that exist due to GPU's capabilities in comparison to general processors, and to suggest possible enhancements to GPUs to facilitate GPU-based encryption. In regards to the decryption, the intent of the prototype is not to force existing ciphers to fit within a GPU in order to decrypt data, but rather to implement as many operations as possible within the GPU and confine the remaining ones to a C program in order to illustrate the concept of how GPU-based decryption can be used in a real application.
5.3.2
Architecture
Figure 5.3 depicts the overall architecture ^. A server encrypts the data and sends it to the client. The data remains encrypted until it enters the GPU where it is decrypted and displayed. The GPU's buffer is locked to prevent the display from being read by other processes or the operating system, effectively turning the framebuffer into a write-only memory. The decryption is performed via software (running on the client's operating system) that issues commands to the GPU (as opposed to a compiled program existing and executing entirely in the GPU's memory), with the operations performed within the GPU. This software does not have access to the keys and data contained inside the GPU; rather, it specifies the transformations (i.e., decryption steps) that the GPU must undertake by issuing OpenGL commands to the GPU. Ideally, any intermediate data produced by the decryption program, such as the key stream, are confined to the GPU. Section 5.4 explains why this is currently not possible due to the GPU's API and capabilities. The decryption key changes on a per-session and application basis (and may even change within a session). The key must be conveyed to the GPU in a manner that prevents the client's operating system from gaining access to it and that allows the key to be set in real time. One way to achieve this is to remotely key the GPU and decrypt the key therein. Ideally, the key is used to generate the key stream directly within the GPU, exposing neither the key nor the key stream to the operating system. The decryption of the key and generation of the key stream can be performed in a non-visible buffer (back buffer) on the GPU, to avoid visually displaying the key and key stream. Reading the encrypted image into the back buffer with the logical operation of XOR enabled results in the image being decrypted. The result is then swapped to the front buffer
73
Remotely Keyed CryptoGraphics
Program on client issues OpenGL commands to GPU to generate key stream.
Proxy f r ^ (card reader) t \ V
Client transmits CPU's certificate to proxy. Proxy sends secret key to GPU.
Insert smartcard
Figure 5.3. Architecture for Remotely Keyed Decryption in the GPU
to display the decrypted image to the user. None of these operations require copying the image (plaintext) to the system's main memory. There are a few possibilities for how the entities involved are authenticated and how the key is sent to the GPU, depending on which components are trusted. Three scenarios are described here. In each case, it is assumed that the GPU contains a pre-installed certificate and private key. The certificate must be placed in the GPU without exposing the private key to the operating system. This can be accomplished by having the certificate issued by the manufacturer and hardwired in the GPU. Another option is to allow writing the certificate to the GPU under circumstances when the client's operating system is trusted, such as when the GPU is first being installed on a newly configured client. In each of the three scenarios, if a secret key only works for n segments of data (such as n video frames), the remote keying will occur as needed to provide the key for each data segment. The first and simplest option for authentication covers the case when the server sending the images is trusted and there is no need to verify the person viewing the images (i.e., it is assumed that the fact that the viewer was able to start the process on the client indicates it is safe to send the images) and/or the server is capable of authenticating a GPU-based on its certificate. The server.
74
CRYPTOGRAPHICS
either by establishing a session key with the GPU or using the CPU's public key, encrypts the secret key and sends it to the GPU via the client. The second scenario is more general. Like the first scenario, the server is trusted. Unlike the first scenario, verification of the user who is requesting to view the display or images is required. This verification is performed through the user of a proxy entity, such as a smart card reader. The user activates the proxy by inserting a card into the smart card reader attached to the untrusted system. The proxy then establishes sessions with both the server and remote client containing the GPU. The server conveys the secret key to the GPU via the proxy, as shown in Figure 5.4. The process of converting the key from being encrypted under server-proxy session key to being encrypted under the proxy-GPU session key requires that the key be exposed only on the smart card. The proxy and the GPU treat the underlying system, including the operating system, as part of the network connecting them to each other and the server, and that the links between these entities denote logical connections. A third scenario assumes that neither the server's nor the client's operating system is trusted. When the images are encrypted, the encryption key is recorded on a smart card. The encrypted images can then be stored on any server. This scenario is not applicable to the real time applications that are considered in this work. To view the images on an untrusted system, the smart card is inserted into a card reader (the proxy) or the key can be manually recorded and entered into the proxy. The proxy, using the GPU's public key, encrypts the secret key and sends it to the GPU via the client. The proxy does not have to be co-located with the client, but only has to be capable of exchanging information with the client. The protocols used for the remote keying are not new. Refer to [31] and [46] for a discussion on authentication using smart cards. The novel component of this work is implementing one in a manner that avoids exposing the secret key outside the GPU. Any protocol used for the remote keying requires utilizing an asymmetric encryption algorithm to either encrypt the secret key directly with the GPU's public key or to establish a session key that is then used to encrypt the secret key when sending it to the GPU. Obstacles arise due to the lack of support in GPU APIs for the operations required for public key ciphers, such as modular arithmetic for large integers, as mentioned in Chapter 4. Furthermore, the GPU's certificate must be placed in the GPU without exposing the private key to the operating system.
5.3.3
Implementation
To determine the feasibility of the scheme, we implemented the second general scenario, in which the server was trusted and the user must be authenticated. Three entities are involved: a server, a proxy and the client. A stream cipher, RC4, is used to encrypt the images (as opposed to a block cipher) because of
Remotely Keyed CryptoGraphics
75
the rate of encryption required for streaming video. RS A is used in the remote keying protocol. The prototype performs as many operations as possible in the GPU via OpenGL, with the remaining operations restricted to a C program and which would be moved into the GPU as GPUs evolve. Specifically, computation of the key stream cannot be efficiently implemented entirely in OpenGL for a cipher such as RC4. Small exponents were used for RSA to allow it to be implemented in the GPU in order to demonstrate what would occur if large integers were supported in GPUs. In the description of the prototype, the following notation is used: • J^ == fci, ^2...fcnis the set of secret keys used to encrypt the data, ki encrypts the i^^ subset of data. These keys may be individually pre-determined, or computed through a master key using a pseudorandom function. • A frame refers to one frame of video or one display update, depending on the application. • Rekeying refers to obtaining the next ki. The interval at which rekeying occurs depends on either the number of frames displayed or the elapsed time. • r is the number of frames or requests after which rekeying is required. • t is the amount of time before rekeying is required. • sk is the session key used for communication between the server and proxy. • /c^^^^ is the GPU's public RSA key component. • A:^^^^^ is the GPU's private RSA key component. • m is the GPU's RSA modulus. Figure 5.4 illustrates the steps for the remote keying and decryption of images in the prototype. The GPU has a certificate containing its RSA key stored in its memory. For the prototype, a program on the client uses OpenGL to write the certificate to the GPU then deletes it from the operating system's memory to simulate having a certificate within the GPU. Entering a certificate into the GPU in this manner requires that the process be monitored to ensure that no program on the client gains access to the private key component of the RSA key while it is being written to the GPU. The certificate includes a public parameter containing an indication that the device is a GPU. When the application is started, the client's operating system reads the public information from the GPU's certificate and sends it in a request to the proxy. The proxy, which requires activation either by entering a one-time password or inserting a smart card, authenticates the GPU-based on the information encoded in its certificate.
76
CRYPTOGRAPHICS
Start application
5: Session request
14: Write keystream to framebuffer and XOR with data.
Client GPU
Server
11: Ready for images or display update request 12: images/display updates
framebuffer
^ sy
13: OS writes data from server to framebuffer 3: OS reads publi( components of certificate
OS writes (k,)?"'''^ mod m GPU; GPU computes ,)P"^''modm, saves for in keystream generation.
Proxy 8:D,,(E,,(k,))
T
2: Enter password or insert smartcard
Figure 5.4. Remotely Keyed Decryption in GPU Protocol: Logical links are shown, {i.e., the proxy communicates with the server through the client).
The client also sends a connection request to the server. The server contacts the proxy and a secure session is established between them. This can be accomplished using any protocol designed for secure session establishment. A single session key may be used for the entire session, or the session key can be changed periodically, depending on the protocol. In the prototype, the proxy authenticates the server based on the latter's certificate, and uses a single session key, sk. When contacting the proxy, the server sends a random nonce and its certificate containing its public key for RSA. The proxy generates a random nonce, encrypts it with the server's public key and sends it to the server. The server and proxy both concatenate the two nonces and use a hash of the result as sk. The server sends fci encrypted with AES using key sk to the proxy. The proxy decrypts fci, encrypts it with the CPU's public key and forwards the result, k^ mod m, to the client. The client issues the OpenGL command to turn colormapping on then writes the value received from the proxy to a specific pixel location in the GPU. The colormap corresponds to a;^^^^^ mod m, where
Remotely Keyed CryptoGraphics
11
X is the value being written, and results in decrypting the value from the proxy to obtain fci. The write operation is performed to the CPU's back buffer to avoid visually exposing the resulting pixels (and to avoid disrupting the user with unnecessary interference). As explained later, a series of one-byte values is used for eachfc^.The resulting pixels are used as the key to the stream cipher. The client then signals to the server that it is ready to receive data or, for thin-client applications, makes a request to update a display. The server sends the encrypted data to the client. Ideally, the GPU computes the key stream, writing the resulting bytes directly to the CPU's back buffer. As explained in Section 5.4.1, when using RC4 some C code is used to represent operations that will be performed in the CPU if enhancements are made to the CPU. The client issues the OpenCL command to turn the logical operation of XOR on in the CPU, then writes the data received to the back buffer. The result is the data XORed with the key stream. The buffers are then swapped so the unencrypted image appears on the display. It is common practice to create an image in the back buffer then swap it to the front buffer in order to create a smooth transition between frames. After n frames or t time, the client must signal to the server that it needs the next secret key, sfc^+i, which is conveyed via the proxy as before.
Figure 5.5. Encrypted Image Received by GPU
78
CRYPTOGRAPHICS
h/
^^P'ff^7
v'/ij-f J-^u K;^.'5-f'';'^a|;:c-- -h'^.
-A'^-'V
:l't... 1.1>
••;' • • • • ; " ^ > v ^ v i ' ; . . ^ - • , • • • ;
Figure 5.6. Decrypted Image Displayed in GPU
The prototype uses images encoded with 24 bits per pixel using 8 bits for each of the red, green and blue components. No alpha component is encoded because the image is written to the back buffer (which may not support the alpha component) to be decrypted. The pixel format is a parameter used by certain OpenGL commands, such as the Draw command for writing data to the GPU, and can easily be changed to accommodate other pixel formats. Figure 5.5 shows an encrypted image received by the GPU and Figure 5.6 shows the decrypted result.
5.4
Design Decisions
In this section some design and implementation decisions made when creating the prototype are discussed. These decisions were guided by the constraints of existing GPUs. First, the limitations on programming a GPU to perform general keying and decryption operations are described, and the current inability to provide data compression is discussed. As mentioned in Chapter 2, GPUs are not designed to perform general modular arithmetic and byte-level operations. There are no API commands for common operations such as modular addition and multiplication, and byte-
Remotely Keyed CryptoGraphics
79
level shifts and rotations. Some operations can be performed by a sequence of other commands under certain circumstances, by limiting values to a single byte and/or by reading intermediate results from the GPU to the operating system to allow the result to be a parameter in a subsequent command. The following subsections describe how these limitations impact the ability to remotely key the GPU and decrypt data within the GPU, and the workarounds used to create the prototype. A conclusion is that three enhancements to OpenGL and GPUs are necessary to fully realize the architecture. First, a mechanism for using the contents of a pixel (or pixel component) as a parameter to an OpenGL command without first reading the pixel value from the GPU is required for the remote keying and key stream generation. An example of this is RC4 when computing the index of the next S array entry to write to the key stream is computed {e.g., the step: output(S[(S[i] + S[j]) mode 256]) from the pseudo code in Section 4.3. Second, the ability to perform modular arithmetic using values less than 256 directly is desirable to efficiently implement certain ciphers, such as RC4, within the GPU. Modular arithmetic can currently be performed on values contained within single color components using colormaps when one operand is constant in order to allow for a static colormap and the map is performed on the second operand. Third, support for large integers is needed. Modular arithmetic on the values of magnitude found in public key ciphers is needed to securely implement remote keying of GPUs. While it is feasible that modular arithmetic on integers may be directly supported in GPU APIs as GPUs evolve, it is unlikely large integers, needed for public key ciphers, will be supported in GPUs anytime soon.
5.4.1
Remote Keying
The lack of modular arithmetic and limitations on the range of values in GPUs impacts the implementation of the asymmetric key cipher used in the remote keying. The proxy conveys the secret keys to the GPU via the client's operating system using an asymmetric key cipher. Since existing public-key algorithms require exponentiation and modular arithmetic on large integers, the operations required cannot be emulated in the GPU with existing APIs, except when trivially small values are used or when the values involved can be viewed as a series of smaller values. For example, the exponents and modulus in RSA must each fit within the bits of a single color component of a pixel, making them entirely unsuitable for a security application. The remote keying of the GPU requires only that the GPU be able to perform the decryption function of the asymmetric key cipher. Note that unless the proxy and GPU share a secret key in advance or the user can securely enter the secret key into the GPU when needed, any protocol used to exchange information, whether by merely having the proxy encrypt information with the GPU's public key or by establishing a session key between them, requires use of an asymmetric key cipher.
80
CRYPTOGRAPHICS
Two options were considered for the prototype. First, the operations can be implemented in C code to represent a function that should be in the GPU. Second, restrictions can be imposed on the size of the asymmetric key cipher's components to allow it to be implemented to run in the GPU. However, in the case of RSA this requires that plaintext and ciphertext each be restricted to fit in within a single byte when using 32-bit pixels with one byte per color component, thus requiring the modulus and exponents also each fit within a single byte and resulting in key components too small to be secure, since an exhaustive search for the private key and data is easily performed. In order to illustrate the concept of decryption using public key cryptography within the GPU, "toy" values less than 256 were used in the prototype for the private exponent, public exponent and modulus. The use of RSA with "toy" values will be referred to as mini-RS A. A series of 8-bit values were used to represent the data, specifically the secret key for RC4 in the prototype, encrypted with RSA. Each 8-bit value is encrypted with mini-RS A by the proxy and sent to the GPU where they are decrypted and used as the bytes of the RC4 secret key. When using RC4 as the key stream generator, up to 256 single-byte values can be in the series for RC4's secret key. A third possibility that is worth exploring is the integration of a decrypting GPU with a trusted platform module (TPM) such as the one proposed by the Trusted Computing Group. The TPM could provide certificate storage and handling, as well as remote attestation and key negotiation. The GPU can then handle image decryption using the TPM negotiated session key.
5,4.2
Decryption of Data in the GPU
To decrypt the images received from the server, the GPU on the client must run a symmetric key cipher; as described previously, a stream cipher was used. Two options for the stream cipher were considered: using an existing stream cipher and designing a stream cipher suitable for a GPU. With respect to running an existing stream cipher within a GPU, operations found in stream ciphers make this infeasible either due to the nature and number of OpenGL commands required to emulate the operations or due to the infeasibility to convert the operations to execute within the GPU given limitations of the API. As explained in Section 4.3, existing stream ciphers, such as LILI, RC4, SEAL, SNOW and SOBER, are unsuitable for implementation in a GPU. RC4 was chosen because most of its operations can easily be implemented in OpenGL. However, it is not practical to do so because the specific OpenGL commands required result in poor performance. The use of irregularly clocked feedback shift registers in LILI and SOBER, and 32-bit words in SNOW and SEAL, among other operations, result in these stream ciphers having a lower percentage of operations that can be implemented in OpenGL when compared to RC4.
Remotely Keyed CryptoGraphics
81
The operations in RC4 consist entirely of adding two bytes, modulo 256 and swapping two bytes. Thus, the only operation required of RC4 that is lacking in a GPU is modular arithmetic. Since the modulus is 256, all values can be represented by single bytes and can be stored as individual pixel components. Given two integers, a, h in the range [0,255], a + 6 mod 256 can be computed using a colormap. This requires knowing either a or 6 in advance to determine which colormap to activate. For each integer, a, in the range [0,255], create a colormap where the i^^ entry corresponds to a + i mod 256. To compute a + 5 mod 256, h is stored as a pixel component, the colormap for a is activated, then the pixel containing h is copied to a new location. The result written to the new location will be the h^^ entry of the colormap. This poses two problems. First, while OpenGL is used, the command to activate a colormap must be issued by a program running on the operating system, requiring a to be exposed to the operating system. While this does not expose the key stream to the operating system, it does provide partial information to the operating system, which may be helpful in determining key stream values. Second, the copying of pixels between locations in the buffer is one of the slowest operations within GPUs. In addition to the copy needed to compute the sum, copies are needed to update the indices and move bytes into the appropriate pixel components and locations. As a result, implementing RC4 in OpenGL was not a practical option at the time the prototype was developed. Therefore, the key stream generator of RC4 was implemented in C to represent a function that will eventually be moved into the GPU. The key stream bytes are written to the GPU as they are computed. This requires the C function computing the key stream to read the secret key from the GPU. Initially, each byte of RC4's output was written directly to the GPU as it was generated. However, the number of writes required (750,000 for a 500a:500 image) resulted in poor performance. The prototype was changed to compute the key stream bytes for an entire row of pixels before writing them to the GPU, reducing the number of writes to the height of the image with the tradeoff that a segment of the key stream is temporarily stored in the operating system's memory. Due to the inability to efficiently generate a key stream within a GPU by using an existing stream cipher, a possibility is to design a stream cipher utilizing graphics operations for which GPUs are designed. This is described in Chapter 7. While creation of a new stream cipher suitable for current GPUs is feasible (and in fact may have wider applicability than the GPU-based encryption applications), the same is not true for asymmetric key ciphers, since this would require devising a new one-way function that does not depend on the hardness of factoring or of discrete logs due to the need to avoid exponentiation and modular arithmetic on large numbers. While the proposed approach protects the secrecy of the images sent to the untrusted system, the integrity of these images is not protected. This could allow
82
CRYPTOGRAPHICS
an attacker to change parts of the image, although changing large portions of the image would be immediately detectable by the user, as it would produce corrupt output on the screen (since the attacker does not know the session key for the stream cipher). If single pixels or small areas of pixels area replaced, the alteration is not likely to be noticed by a user, but small, unnoticeable changes will also not be a useful attack because the meaning of the display or image seen by the user is not altered. Adding a message authentication code (MAC) to the scheme is technically feasible if the rate at which frames must be updated does not matter. A MAC is typically computed using a block cipher (such as the CBC-MAC) or with a hash function (HMAC). Whether or not the required operations can be performed in the GPU depends on the specific block cipher or hash function used. Since AES and the CBC mode of encryption can be performed in a GPU, at least the CBC-MAC using AES can be computed on the image. However, if a MAC is computed in the GPU on every frame in streaming video, this will noticeably degrade the rate at which frames can be displayed to the user because small groups of pixels will be treated as blocks of data that have to be processed in serial to compute the MAC in contrast to the decryption step which allows XORing all of a frame's pixels simultaneously with a segment of the key stream. If the display updates are small or there can be a slight delay before the update is visible to the user, as in the case of thin-client applications, then it may be possible to compute the MAC before displaying the update to the user. The GPU will have to be programmed to present an indication that a display update failed authentication for cases where the computed MAC does not match the value sent with the update.
5.5
Experiments
To determine the feasibility of the architecture, two sets of experiments were conducted to measure the ability of current GPUs to sustain decryption rates compatible with the example applications. OpenGL was used as the API to the graphics card driver. No vendor specific OpenGL extensions were used, making the prototype GPU independent. GLUT was used to open the display window. The only requirement is that the GPU must support 32-bit "true color" mode, as the routine for decrypting the secret key requires representing bytes in a single-pixel component. The code for the client consists of C, OpenGL and GLUT, compiled using Visual C++ version 6.0. The processes for the server and proxy are written in JAVA, using version L4.2_03 with the JAVA Cryptography Extension. The experiments utilized three different clients in order to test different GPUs. The environments were selected to represent a fairly current computing environment (at the time the experiments were performed), a laptop and a low-end GPU. In all cases, the display was set to use 32-bit true color with full hardware acceleration. The clients are:
Remotely Keyed CryptoGraphics
83
1 A Pentium IV 1.8 GHz PC with 256KB RAM and an Nvidia GeForceS Ti200 graphics card with 64MB of memory, running MS Windows XP. The GPU driver uses OpenGL version 1.4.0. 2 A Pentium Centrino 1.3 GHz laptop with 256KB RAM and an ATI Mobility Radeon 7500 graphics card with 32MB of memory, running MS Windows XP. The GPU driver uses OpenGL version 1.3.425. 3 A Pentium III 800 Mhz PC with 256KB RAM and an Nvidia TNT32 M64 graphics card with 32MB of memory, running MS Windows 98. The GPU driver uses OpenGL version 1.4.0. Streaming video applications, such a NetMeeting, were simulated by sending a stream of images from the server to the client. Tests were performed with frame sizes of 320x240 and 500x500 pixels. The frames were encrypted and stored in individual files on the server prior to starting the application. A small number of unique frames were created and the server repeatedly cycled through the set. This was done to provide a steady stream of images and avoid any delay in encrypting images on the server from impacting the measurements, which were focused on the client's performance. To measure thin-client performance, the average update size of 2,112 pixels (a 16x132 pixel area) was used. The average is from the distribution of update sizes in the standard i-Bench [86] web benchmark for thin clients. The update sizes in i-Bench range from 1x1 areas to 1,007x622 areas (626,354 pixels). All tests used images encoded as 24-bit RGB pixels, with 8-bits per color component. For each image size, two types of tests were run. The first set of tests determined the delay due to the additional computation needed for the remote keying and decryption, compared to sending unencrypted images. In these tests, all three entities (server, proxy, and GPU) were run on the same PC or laptop. Each of the three clients was tested. The results of the first set of tests are shown in Figure 5.7. The second set of tests involved running each entity on separate systems on a local area network (LAN) to determine the overall performance when the data arrival rate was impacted by network delay. The first client with the Nvidia GeForce3 GPU was used for these tests. Figures 5.8 and 5.9 show the results of these experiments. Two tests were run using two different LANs. In one case, the server and proxy were dedicated to the experiment and there was no traffic leaving the server and proxy aside from that due to the experiment. In the second case, the tests were run on shared servers used for general purpose computing. In both cases, each element had a 100Mbps connection to the LAN. There were three hops between the client and server, and between the client and proxy; there are two hops from the proxy to the server. For all tests, the number of frames per second for both encrypted and unencrypted frames are provided. In video conferencing applications, the number
84
CRYPTOGRAPHICS
DA: client 1 unencrypted El B: client 1 encrypted E!i C: client 2 unencrypted D D: client 2 encrypted • E: client 3 unencrypted m F: client 3 encrypted
A B C D E F
A B C D E F
16x132
320x240
A B C D E F
500x500
frame size in pixels
Figure 5.7. Decryption Rates: All Entities on a Single System
of frames supported per second is important: a minimum rate of 10 fps is required to obtain tolerable video and is typical in such applications, with 24 fps and higher rates required for better quality. In contrast, the rate of updates in thin-client applications is dependent on user requests and will be sporadic. The frames per second reflects the maximum burst rate supported. The intention of the experiments was not to build a robust streaming video application using the Real-Time Transport Protocol (RTP), which accounts for delay, rate of transmission and lost packets. Rather, the focus was to determine the feasibility of remote keying and decryption within the GPU, and to measure the resulting overhead. Therefore, TCP was used for all communication between the entities. When testing streaming images over the LAN, it was necessary for the client to signal the server when it was ready for the next frame to avoid synchronization problems. At least 99% of the delay when decrypting frames with RC4, compared to using unencrypted images, is due to the writing of the key stream bytes to the GPU. The key stream was written to the GPU one row at a time. When the test is run with the write eliminated (all other operations for the decryption
85
Remotely Keyed CryptoGraphics
90 80 70 T3 C
60
Ui
50
o o o k_
CD Q.
:S«|—
1 A: client 1 unencrypted El B: client 1 encrypted
40
HI
30
psHI
20
• •*•#•
10 r*xin
iitit
A B C D E F
A B C D EF
16x132
320x240
A BCD
EF
500x500
frame size in pixels
Figure 5.8.
Decryption Rates: Dedicated Lan and Client 1
are still performed), the average time is the same as that for the unencrypted images. The actual computation of the key stream per frame, enabling the logical operation of XOR in the GPU and swapping of buffers takes less than \ms for the 500x500 frames on all clients. When testing the average thin-client display size update (2,112 pixels), the times for the encrypted updates were the same as for the unencrypted updates because the key stream required only 16 writes to the GPU. In contrast, the 320x240 and 500x500 pixel frames required 240 and 500 writes per frame, respectively. The limiting factor in the processing of the 2,112-pixel updates is the time for the server to create the update (read the update from a file in the experiment). To determine the rate at which the client can process 2,112-pixel updates if creation of updates is not a limiting factor, an array containing 2,112 pixels was stored in memory on the server and repeatedly sent to the client. The server and client were running on the same system to eliminate network delays and bandwidth restrictions. The client can process over 500 updates per second on each of the three platforms, indicating that decryption overhead and the GPU
86
CRYPTOGRAPHICS
90 80 70 60
mi |s:H:| 1 A: client 1 unencrypted
50
Bl B: client 1 encrypted
(D QL
C/) O
E
40 30 20 10
a:! 6«:s;
IJlP A B C D
E F
16x132
A
B C D
E F
320x240
A
B C D
E F
500x500
frame size in pixels
Figure 5.9. Decryption Rates: Shared Lan and Client 2
are not limiting factors for small updates. For larger updates in thin-client applications, an increased delay, e.g.^ when the entire display changes, is not considered to be an issue since such updates are typically infrequent and, from a human factors perspective, are no worse than loading of some web pages or opening of applications. When sending images over a LAN, the decreased rate for the 320x240 and 500x500 pixel frames compared to the case when all processes were on the same PC is due to the rate at which images are sent from the server to the client being limited by the bandwidth. Even if no bandwidth is consumed by protocols, a maximum of 16.66 uncompressed 500x500 RGB frames can be transmitted per second on a 100Mbps interface. The time for the remote keying is mainly dependent on the time to enter the password or insert the smart card into the proxy. This may take a few seconds if a password is entered. Aside from this, the time is dependent on the protocol used and on the transport delay between the entities. Using a public key encryption algorithm (RSA), generating random nonces and encrypting the
Remotely Keyed CryptoGraphics
87
secret key with AES added approximately two seconds to the processing in each environment.
5,6
Conclusions
The prototype addresses the feasibility of decrypting images and displays within a GPU as a way of combating the rising threat of spy ware . The primary insight is that a suitably modified GPU can serve as a minimally trusted computing base for displays in certain types of widely used applications, such as video conferencing and remote desktop display access. The main mechanism in the scheme is decryption of frames exclusively inside the GPU, without storing either the key material or the plaintext on the system's main memory. The technique can protect against many types of spy ware, as well as several attacks aimed at the human interface layer [44]. It was explained why this scheme cannot fully be realized due to current limitations of GPUs. Enhancements needed to GPUs to overcome these limitations were identified. The prototype demonstrates that the concept of GPU-based decryption is feasible for thin-client applications and the video broadcast in conferencing applications. To further improve performance when decrypting video in the GPU, image compression facilities will need to be implemented inside the GPU, a trend which is already occurring. In addition, the performance numbers show that for typical video conferencing frame rates, and web browsing and remote desktop access using thin clients, the lack of compression is not a bottleneck for the performance of the system.
Notes 1 The architecture and experiments for the remote keying of GPUs presented in Sections 5.3.2 and 5.5 were first presented in [16].
Chapter 6 RELATED ISSUES
6.1
Overview
In this chapter, topics related to the architecture and prototype presented in Chapter 5 are discussed. The architecture described in Chapter 5 focuses on securing images sent to an untrusted chent. A complete system must also address the protection of user input on the client that is sent to the server and the protection of audio sent to the client. In addition, an alternative method for keying the GPU is provided. The architecture's susceptibility to man-in-the-middle attacks and phishing attacks is evaluated. The concept of executing cryptographic operations within a GPU can be used in conjunction with the trusted platform module (TPM) defined by the Trusted Computing Group (TCG). An overview of the TPM is provided and how the prototype can utilize the TPM is described. Another issue is where data compression is performed. Compression is unrelated to attacks against the client, but is impacted by moving encryption and decryption into the GPU.
6.2
Protecting User Input
The user responses on the untrusted client pose an interesting problem in that they require preventing input from the keyboard and mouse from being available to the untrusted operating system. One potential solution is to encrypt the keyboard inputs inside the keyboard itself (e.g., on the keyboard's USB controller). This requires a trusted keyboard, which is possible by using a portable folding keyboard that connects to USB port, such as those available for several PDA devices. The mouse may be directly connected to the keyboard (e.g., a TrackPoint device, as is common with laptops), or input may only be taken from the keyboard. A pin can be used as the key to the cipher used for encrypting the inputs. The pin can be of sufficient length to thwart a brute force
90
CRYPTOGRAPHICS
attack. The server may either choose a pin for the user, displaying it securely to the user via GPU-based decryption, or have the user select a pin from a keypad displayed on the GPU. If the server selects the pin for use in the keyboard, the server merely sends the pin as an encrypted image to the client's GPU where it is decrypted and presented to the user who then enters it into the keyboard. The pin can be a relatively small, unpredictable area of the image. An attacker or malware attempting to modify the pin will at best have access to the encrypted image. Other possibilities include the use of graphical passwords [20,83] and shouldersurfing-resistant PIN-entry methods [69]. Another option for conveying user input to an application on an untrusted client is the method described in [48] in which a trusted channel between a PDA (a cell phone) and the application requiring the input is used. The user's PDA provides a trusted device by which the user enters input. Graphically displayed keypads are used on some websites to allow a user to enter a pin to access his or her account by selecting values via mouse clicks. In some implementations the ordering of the values on the keypad change after each mouse click. Variations of such displays can be used to set the pin for the keyboard and to provide a secret key to the GPU for use in a symmetric key cipher. The user can select a pin if the server displays a keypad to the user via the client's GPU. The keypad is sent encrypted from the server to the GPU where it is decrypted and displayed to the user. Then the user selects characters from the keypad by clicking on or entering a series of squares from the keypad, with the coordinates of the selections sent to the server. Even though the client's operating system can see the coordinates of the user's selections (since the keyboard and mouse inputs are not yet encrypted), it does not have access to the unencrypted keypad, making this information useless to an attacker. To avoid guessing attacks based on the relative locations of the mouse pointer, the keypad configuration is changed every time a digit is selected as shown in Figure 6.1. The keypad can be spread across the display with each digit displayed in an arbitrary location determined by the server as shown on the right side of Figure 6.1 instead of in the traditional rectangle form. If an attacker or malware on the client attempts to alter the coordinates sent to the server, the altered values may not correspond to valid positions on the keypad. Minimizing the area of the display corresponding to digits will decrease the probability that malware can select coordinates that correspond to digits.
6.3
Keying the GPU
The idea of a user clicking on a keypad displayed on the screen can be used to convey the secret key used for decryption in the GPU in place of the remote keying protocol. If the key used for encrypting data changes after a certain
91
Related Issues
4 7
enter
5
8 0
3 9 clear
2 6 1 keypad when entering first digit
keypad wlien entering second digit
Figure 6.1. Graphical Keypad for Digits: Each time the user selects a digit the key pad changes to prevent malware from associating coordinates with a specific digit.
number of frames, the key pad can be used to enter a session key that the server and GPU will use to establish the keys for encryption. If one key is used for encryption during the entire session, the user can enter it via the keypad. When the user is inputting the key, colored shapes can be displayed on the screen in place of traditional ASCII characters. The byte level representation of the color corresponds to key bytes. As the user clicks on shapes, the GPU copies the pixel value from the selected coordinate to the area of its memory where the key is stored. The positions of the shapes and how the color values are assigned to them will vary each time a key is entered or each time a selection is made to avoid the possibility for malware on the operating system recording the values, which can occur if the display is static and an adversary manually programs the display information into the malware. There is a human factor's issue with using colors in place of ASCII characters, namely that a series of colors is more difficult to remember and distinguish (especially between shades that differ only in one bit) than characters. While the GPU needs the pixel value, the hex value of the pixel can be displayed within the shape to assist the user. Unlike an alpha-numeric pin, the secret key for a cipher can take on any value and typically consists of 128 or 256 bits. Entering a large segment of the key at once, such as by having the user select four 32-bit values, is infeasible
92
CRYPTOGRAPHICS
Figure 6.2. Graphical Keypad for Hex Values: The user enters a key by selecting the shape containing the hex value corresponding to the next four key bits. Color coding can be used to assist the user in locating a value. In this example, odd values are shaded using the blue color component with the other colors set to 0. Some shapes may not contain a value. Clicking on a shape will trigger the GPU to write a pixel whose color represents the hex value to the framebuffer. The GPU can be programmed to wait for multiple clicks before determining the pixel value.
because of the number (2^^^) of colors that will need to be displayed to the user and a user's inability to distinguish and locate individual values from such a large number. In order to limit the number of colors that must be displayed, only one color component may be used (for example have an 8-bit red value set and the green, blue and alpha components held constant as all O's so they won't influence the color displayed. This still requires 256 colors be displayed to the user and will be difficult for a user to search through even with the hex values indicated if they are in a random order. The green or blue color component can be used for some values instead of red to make it easier for a user to find specific values. For example, representing all values with the least significant bit set to 1 as a shade of red and all other values as a shade of blue. Fewer colors can be used to make it easier for the user to locate the correct value on the screen by increasing the number of values the user must select.
Related Issues
93
Instead of the user selecting one byte at a time, only 16 values can be displayed to the user by using only the lower four bits. Now the user will enter the key as a series of 4-bit values instead of 8-bit values. Refer to Figure 6.2. GPUs can easily be programmed to fill in a pixel value based on a mouse click. The GPU can wait for multiple mouse clicks before populating the entire pixel value when the user is entering only 4 bits at a time. An alternative for how the GPU handles the 4-bit inputs is to alternate which 4 bits of a color component are used when creating the keypad. Then the pixels from two sequential selections can be XORed together to produce an 8-bit value that is stored in a single color component. For example, assume an 8-bit red pixel component is being used. Let kh refer to a pixel location that will store the first byte of the key the user enters. Display the hex values in the range of 0x00 to OxOf to the user in shapes colored with pixels whose blue and green components are set to 0, and the red component takes on the values 0x00 to OxOf. Write the pixel value from the shape the user selects to kb. Then change the display so the hex values are now displayed using red values that are 0x00, 0x10, 0x20 ... OxfO. Turn the logical operation of XOR on and write the pixel value from the shape the user selects to kh. If the user selected "7" and "4" sequentially from the display, the resulting pixel value in kh has a red component of 0x47, and blue and green components ofO.
6A
Attacks
The scheme for remotely keying the GPU using the proxy as described in Chapter 5 is susceptible to a man-in-the-middle attack. This is because the proxy, server, and client are assumed to communicate over an untrusted network that includes the client's operating system, making it possible for an attacker to perform a man-in-the-middle attack using another system (a client #2 that has a GPU with a valid certificate) to perform the key exchange with the proxy device. Let client #1 refer to the client where the user wishes to see the display. Refer to Figure 5.4 for the protocol. Client #2, pretending to be client #1, sends the certificate to the proxy. The proxy may either be connected directly to client #1, such as by a USB port, or communicating with client #1 over a LAN or WAN. In the first case, malware running on the client will have to serve as an intermediate entity {i.e., a proxy, although this term is not used here to avoid confusion with the proxy involved in the protocol) in the communication between the proxy and client #2. Malware running on client #1 intercepts the certificate from client # r s GPU so it is not sent to the proxy (which instead receives client #2's certificate). The proxy, server and client #2 then complete the key establishment protocol, at which point client #2 will have the secret key needed to decrypt the displays. In addition, client #2 impersonates the proxy in communication with the GPU and establishes the same secret key with client # r s GPU. Client #1
94
CRYPTOGRAPHICS
and the server establish communication per the protocol, with client #1 sending the session request and request for images or display updates, and the server sending the images or display updates to client #1. The malware on client #1 copies the information received from the server and sends it to client #2 where it is decrypted and provided to the attacker. The images or display updates are also written to client #Vs GPU as normally would occur so the user is unaware that the information has been copied and decrypted elsewhere. This attack is feasible because the proxy cannot verify that the GPU with which it is exchanging information resides on client #1. A possible solution is the use of packet leashes [33] in the context of the communication between the proxy and the GPU. A packet leash involves including an identification tag in each packet that allows the receiving entity to determine where the packet originated. However, this would place additional requirements on both the GPU and smart card, and increase their costs. The attack is not applicable when the GPU is keyed by the user selecting from a keypad as described in Section 6.3. Phishing attacks involving the redirection of web page requests are more difficult to perform in the architecture used in the remotely keyed GPU prototype. This is because, without a man-in-the-middle attack as described above, the phishing must be performed at the server, which will be referred to as server #1. Consider what happens if the phishing attack redirects requests from the client intended for server #1 to a web server, referred to as server #2, containing a fake web site. The display the user sees and the user's responses are normally encrypted. If the user's responses are redirected from the client, server #2 will not be able to decrypt them. If server #2 attempts to send a web page or other display to the client, the client's GPU will attempt to decrypt it even though it is already plaintext and present a meaningless display to the user. Instead, the phishing attack must be able to redirect requests from the server intended for a valid web site to a fake web site provided by server #2. Server #1 will encrypt any web pages received from server #2 and send them to the client's GPU, where they will be properly decrypted and displayed to the user. Any information the user enters on a fake web page will be conveyed to server #2 via server #1. If server #1 is assumed to be trusted, in order to perform the redirection, the attacker must perform the redirection in the network, such as by false DNS entries, as opposed to methods that require access to server #1, such as modifying the host file or running malware on the server. Phishing attacks that operate by sending email to the user in an attempt to get the user to click on a url contained within the email for a fake web site can continue to work even if all displays the user sees pass through server #1 first to be encrypted. In a thin-client scenario where a user is reading email, server #1 will send display updates to the client reflecting the contents of the email. If the user clicks on a url in the email, server #1 will receive the request from the user, retrieve the web page from server #2, then send an encrypted display update
Related Issues
95
corresponding to the web page to the cHent where is it decrypted in the GPU. Any information the user enters on the web page will be returned to server #2 via server #1.
6.5
Trusted Platform Module
It is worth considering how the use of a trusted GPU for protecting displays can be incorporated into or utilize the architecture defined by the TCG. The core component of the TCG's architecture is the TPM. First, an overview of the TPM is provided. Second, how the GPU-based decryption can utilize the TCG architecture is discussed. The TPM is a microcontroller that provides key generation and a mechanism for authentication of the platform or its components. Specifications of the TPM are available in [84]. The TPM functions include key storage, key generation and digital certificate storage. The intent is that the keys stored and generated by the TPM are more secure from both physical and software attacks than if this information was stored outside the TPM. It also stores digests of certain system measurements, referred to as integrity measurements, in its platform configuration registers (PCRs). The TPM contains a random number generator for use in key and nonce generation, support for RSA, HMAC and a hash function (the current specification indicates SHA-1). It may also contain code for measuring platform devices, but this function is allowed to be located outside the TPM if necessary for implementation reasons. Hardware implementations of TPMs must be tamper resistant and nonremovable. Hardware versions should be attached to the motherboard of the PC. Software implementations are required to have a level of tamper resistance equivalent to that of hardware implementations, although no recommendations or suggestions for how to obtain this are included in the TCG specifications. The presence of a TPM does not guarantee the security or safeness of all software executing on a platform because the TPM does not control what software can run or report the status of running software. Software does not have to be certified to run on a platform with a TPM and detection of most threats by applications is left to the operating system. When a system containing a TPM is started, the TPM will go through a self initialization phase then take the integrity measurements, but does not analyze these measurements. It is not the function of the TPM to detect potential threats from these measurements. Any analysis and action are left to components outside the TPM, such as the operating system: "The operating system program loader is the next logical soft component to measure a program prior to loading it. Since the operating system helps enforce system integrity, it is reasonable for the program loader to both measure and enforce policies describing unacceptable software configuration state." [84], page 25. An application can contain a policy defining what it considers to be a trusted platform configuration. The application may not
96
CRYPTOGRAPHICS
execute or may limit the permitted interaction with the platform if the platform fails to meet the policy. The TPM can be deactivated for a short period by an "operator" to allow interactions with the platform without the TPM. The TPM must have various credentials or certificates installed on it, and full realization of the TCG's architecture requires credentials installed in all system components and software. The TPM must be delivered with an endorsement credential embedded in it by the manufacturer that indicates the manufacturer, part number, TPM version number and that contains an endorsement public key, referred to as the endorsement key (EK). The EK is a unique value that identifies the TPM. Conformance credentials are added by an evaluator. These contain information about the platform and TPM manufacturers. A platform credential contains information about the manufacturer. An attestation identity credential (AIK) that has the private key used to sign PCR values and the AIK public key is also present in the TPM. The AIK also contains references or pointers to the manufacturer information in the other credentials. Each credential is signed by its creator. Manufacturers of individual components, such as the mouse, keyboard, adapters, GPU and software, can include a validation credential in their products that contains information about the manufacturer, some measured values and possibly a list of capabilities of the product. However, a validation credential is currently not required for a product to run on a platform with a TPM. Notice that the TPM requires the installation of several credentials, including a key unique to the TPM. The prototype in Chapter 5 requires a certificate be installed on the GPU, which is less difficult to implement than the credentials defined by the TCG. In fact, if GPU manufacturers fully participate in the TCG architecture, GPUs will have validation credentials added to them, at which point a certificate can be included. However, with the presence of a TPM and a GPU that utilizes the software interface to it, the TPM can generate the RSA keys required by the GPU for the remote keying protocol. Note that if the user can enter the key into the GPU via the keypad method, the RSA keys are not needed. The TCG architecture can be used to attest to whether or not the GPU has been modified, which provides a mechanism for a user to determine whether or not a GPU on a remote client can be trusted. This currently requires that the operating system itself be trusted (validated), which is not assumed when using the GPU-based decryption. If all software installed on a system must be validated and/or the operating system can be validated and does not allow copying of data by software not supplying credentials, this will significantly reduce the chance of spy ware copying the display data if it is decrypted by a process running on the operating system. However, a shared, publicly available, client is unlikely to have all software on it validated if the client's purpose is to serve as a general purpose client because all software, especially freeware, will not adhere to the TCG architecture. Specifically, there will likely be software
Related Issues
97
without credentials that users will still want to use (and have valid reasons to run) on the client. In this case, GPU-based decryption is beneficial.
6.6
Data Compression
Traditionally, remote display and video conferencing systems have made extensive use of data compression in order to maximize network utilization and allow use of the application in bandwidth-limited environments. In most cases data compression is handled outside the GPU. Encrypted data, which ideally is pseudorandom bits, is not compressible. Any application involving encrypting data on a server and sending it to a client must compress then encrypt the data on the server. The client will decrypt the data then decompress it. Performing compression and decompression outside the GPU is not a concem when we are only trying to leverage the GPU's processing power as a cryptographic co-processor, since the data is returned to whatever application running on the CPU utilizes the data with no need to hide the plaintext the operating system. However, when the goal is to protect the plaintext from spyware on an untrusted system, decrypting display data in a GPU serves no purpose if the data then needs to be read back to the untrusted system for decompression. A straightforward solution would be to add hardware decompression abilities to the GPU. This could be accomplished by using widely available data decoding chips, such as MPEG hardware decoders; indeed, several DYD-ready GPUs contain such logic already. An alternative approach, in particular for thin-client scenarios, would be to tailor the display protocol and its compression to use operations available in the GPU. More recent thin-client systems have proposed remote display protocols that employ different types of commands and compression algorithms for different kinds of display updates [74]. The advantage of this approach derives from the characteristics of the protocol commands that provide inherent compression, negating the need for additional, specialized compression algorithms. For example, a command that instructs the client to fill a rectangular region with a particular color consumes very little bandwidth while compressing a potentially large region of the screen. E.g., draw ( 1 0 , 2 0 , 5 0 , 5 0 ) Oxef557777 to draw a 50 by 50 pixel rectangle with the lower left comer at (x,y) coordinates (10,20) and fill with the color 0xef557777 compared to sending 2500 32-bit pixel values of Oxef 557777. Execution of such a command is clearly within the operations available in existing GPUs. By appropriately designing the remote display protocol to utilize similar operations, it is possible to improve the architecture to consume reasonable bandwidth without compromising security.
Chapter 7 EXTENSIONS
7.1
Overview
In this chapter extensions to the work described earlier are presented. The first topic is the design of a symmetric key cipher for use in a GPU. An overview of how a stream cipher may be created using graphics operations is presented. The second topic is the protection of audio from malware on an untrusted client through DSP-based encryption.
7.2
Graphics-based Cipher
A GPU-based cipher would not only be beneficial to thin-client applications and remote video display discussed in previous chapters, but also serve as a general purpose cipher in any system containing a GPU. By mapping a texture exhibiting sufficient randomness to a continuously morphing image while changing certain variables, such as viewpoint and lighting, and extracting pixels from the image, a key stream is generated. The key stream is never within the client's memory in this case unless it is read from the GPU for use in an application running on the CPU. Experiments with an initial version were performed in order to estimate the time to compute the key stream. The first step is to generate the initial texture. Given a secret key, an initial pseudorandom texture can be generated that will serve as the seed texture for the GPU-based stream cipher. An existing stream cipher or random-bit generator (using the secret key as the seed) can be run to create enough bits to fill the targeted display size. The output is converted to pixels and used as the texture. A second option is to encrypt an image or any data of sufficient length using an existing block cipher and use the resulting ciphertext as the texture. In both options, the texture can be computed in advance then treated as if it was the secret key or the texture generation can be viewed as the first step in the GPU-
100
CRYPTOGRAPHICS
based stream cipher. For applications using the GPU to generate and apply a key stream to data used in other applications (i.e., not for encrypting or decrypting displays), the time to generate the initial texture may be less of an issue than in applications involving real time display updates. When decryption in real time is required, the initial texture should be generated before the user expects the first image to avoid any perceived delay in the time to display the first image. Once the initial texture is generated, it is mapped to one or more three dimensional objects whose surface encompasses at least the entire viewing area. The objects are then manipulated. The goal is to find a series of operations which produce pseudorandom pixels. The entire resulting image does not need to be pseudorandom; instead a subset of pixels from it can be added to the key stream after each iteration of the steps. Obviously, the larger number of pixels which can be extracted after each iteration, the faster the key stream is generated. If neighboring pixels are extracted for the key stream, dithering must be disabled when generating the image. In order for the same pixel values to appear in the same location on multiple CPUs running the stream cipher with the same key, the pixel size and resolution of the display must be identical in the GPUs. Rounding must also be considered if the key stream is to be reproduced on different graphics cards. For example, if a vertex program is used that alters the location of an object's vertices as part of the manipulation, the resulting coordinates may be impacted by rounding. The coordinates will not be exact if an equation can produce coordinate with a fractional part before rounding. If a vertex ends up with (x,y,z) coordinates of (100.3000,100.4999,0), the pixel at the location with (x,y) coordinates of (100,100) will contain a pixel that is the color of the vertex in the object. Slight differences in the preciseness between GPUs can result small discrepancies in the images that are undetectable by the human eye but that result in some pixel values differing between the image when it is generated in different GPUs. For example, if the y coordinate becomes 100.5000 instead of 100.4999, the pixel at location (100,101) on the display will be impacted instead of the one at (100,100). If the GPU truncates values instead of rounding them, the values of 100.999 and 101 will produce a coordinate value of 100 and 101, respectively. One idea that can eliminate the effects of vertices or shapes differing in location slightly due to rounding is to have a texture which involves blocks of different colors. The colors are pseudorandom. After an iteration of the steps manipulating the objects, pixels from the center of blocks at certain locations are added to the key stream. To estimate the time required for computing a key stream designed for the GPU, an initial image of a cube was loaded into the GPU with a random texture. The texture was pixels formed from bytes generated from the RC4 stream cipher. The cube was rotated, and its position, orientation and the angle from which it was viewed were altered. The lighting and fog settings were also changed. The time to execute all of the OpenGL operations under consideration was
Extensions
101
measured. After each series of executions, the resulting image is the key stream and is XORed with the current encrypted frame. The execution per frame is less then Ims, indicating that any differences in the time to process encrypted frames versus the time to process unencrypted frames will be unperceivable to the user. In proposing to design a new stream cipher suitable for executing within GPUs, it must be ensured that the cipher can also be efficiently implemented on the server for the cases where a server is encrypting data before sending it to a client which uses a GPU for decryption. If the encryption algorithm is such that it must run in a GPU, the server can encrypt the update by writing the image to its GPU and reading the result; otherwise, the server can perform the encryption in its operating system. In video conferencing applications, the images being encrypted may appear on the monitor of the speaker and can be encrypted in the GPU before being sent to the server or other conference participants.
7.3
Encryption within DSPs
Performing encryption within a GPU exemplifies the concept of performing cryptographic operations outside the CPU. This concept can be extended to audio when using programmable digital signal processors (DSPs). In addition to images, video conferencing and certain remote desktop applications exchange audio between the server and remote clients, or between clients. Audio can be encrypted and decrypted in the DSP so the CPU on the client only has access to the encrypted audio stream. Implementation of encryption within a programmable DSP is significantly easier than implementing GPU-based encryption. This is due to the operations supported in programmable DSPs. Texas Instruments (TI) programmable DSPs, such as their TMS320C55X series, include a CPU and up to 16MB of memory. The operations supported include the typical byte-level operations found in symmetric key ciphers. Bytes can be processed within the DSP as they would within the operating system's CPU; as a result, there is no need to derive alternate representations of existing symmetric key ciphers or devise a new cipher to work in the DSP. Software development kits (SDKs) assist in moving encryption into a programmable DSP In some SDKs, such as the SDK for the TI TMS320C55X series, code can be written in C or C++ and converted into assembly language for the DSP as opposed to programming directly in a DSP's assembly language. Programming directly in assembly language may be needed to fully optimize the program. Public key ciphers involving large integers are not directly supported in programmable DSPs. This is due to the lack of support for large integers. For example, there is no equivalent of the C/C++ GMP library or JAVA Biglnteger in programmable DSPs. Therefore, conveying a secret key for use in a symmetric key cipher to a DSP via a protocol using public key encryption is an issue.
102
CRYPTOGRAPHICS
An alternative to using a remote keying protocol with DPSs is to convey the key via audio. The user could speak the key (assuming no one is within range to eavesdrop), although this may be difficult to convert into precise key bits given variations in the human voice and that the logic to correctly deal with any fluctuations must fit into the DSP. A more realistic option is to play a series of tones using a PDA which has substantially less variation than the human voice. This is a similar concept to the idea of a user clicking on shapes to convey a key directly to the GPU; now the input is audio. This method increases the potential for an adversary in close proximity to the client from determining the key compared to the direct keying of the GPU via mouse clicks because now the key can be recorded by a hidden device in the vicinity of the client instead of requiring that the adversary see the user's keypad selections.
Chapter 8 CONCLUSIONS
8.1
Summary
The use of GPUs for cryptographic processing was investigated to determine if GPUs can be used to offload processing from the CPU and if GPU-based encryption and decryption can assist in protecting data on untrusted clients. GPUs provide a significant amount of parallel processing compared to any existing multi-CPU configuration. Data can be stored in pixels that are processed in parallel. While the programmability of GPUs has been increasing, GPUs are not designed to be general purpose processors and what algorithms can be implemented in GPUs are limited. The addition of a programmable pixel processor and larger supported pixel sizes increases the potential for using GPUs as general purpose processors, but common capabilities of CPUs are still missing. This is partially due to the APIs for GPUs and partially due to hardware limitations. The implementation of AES demonstrates that GPU-based encryption and decryption is possible with a symmetric key cipher. However, public key ciphers and some symmetric key ciphers involve data types and/or operations that cannot be programmed within existing GPUs. Other ciphers that can be programmed within a GPU require multiple steps to perform some basic operations, such as shifts. The prototype of the remotely keyed GPU demonstrates the concept of decrypting images and displays within a GPU as a means of combating spy ware. It is applicable to scenarios where untrusted clients are used to access remote desktops or to participate in video conferences. The primary insight is that a suitably modified GPU can serve as a minimally trusted computing base for displays in these applications. The main mechanism in the scheme is decryption of frames exclusively inside the GPU, without storing either the key material
104
CRYPTOGRAPHICS
or the plaintext on the system's main memory. The use of graphical keypads can be used as an alternative method for keying the GPU. The following enhancements to CPU's and/or APIs are needed to easily program existing cryptographic operations within CPUs: • Support for modular arithmetic for use in both asymmetric and symmetric key ciphers. • Support for a data type of unsigned integer for use in symmetric key ciphers. • Support for byte-level operations including rotations and shifts across single bytes, individual color components of pixels and entire pixels for use in symmetric key ciphers. • Support for branching in the pixel processor for use in symmetric key ciphers. • Support for using the value of a pixel and the value of a color component of a pixel as an argument in operations is required for the conditional statements found in most stream ciphers. • Support for large integers to allow asymmetric key ciphers to be executed in the GPU. Support for additional operations, including branching and byte-level operations, is feasible. Support for new data types, especially large integers, is less likely. Aside from the above capabilities for programming the cryptographic algorithms in a GPU, the following new GPU capabilities are needed to fully realize the architecture presented in Chapter 5. The second item is not required when using the keypad method to convey a secret key to the GPU instead of a proxy and remote keying protocol. • An easy mechanism for blocking malware on a system from reading unencrypted data from a CPU is required for the architecture to be useful for protecting displays on untrusted systems. This can be accomplished with a capability for a process to temporarily disable the CPU from responding to any write or read command issued by another process. (Le., disabling of the ability to read data from the GPU by all but one process). • If a protocol involving the use of a public key for the GPU is used to establish the secret key, a defined location for storing a public key or certificate with additional information in the GPU would be useful instead of loading the public key into memory the CPU uses for general operations. Dedicated storage will be needed for the credentials required by the TCG architecture and a certificate can be included.
Conclusions
8.2
105
Suggested Projects
Several extensions to the experiments described in Chapters 4 and 5 are possible. These may serve as exercises for students. The first set of exercises below build upon the OpenGL version of AES. These will familiarize students with programming encryption in a GPU. 1 When AES was implemented in OpenGL, there was no support for 64-bit and 128-bit pixels. Using a graphics card that supports 64-bit or 128-bit pixels, modify the implementation to use 64-bit or 128-bit pixels to process more blocks of data simultaneously compared to what was achieved using 32-bit pixels. 2 An implementation that processes identical blocks of data, each with a different key, can be created. Given a plaintext, ciphertext pair and a partial key, use the GPU to perform an exhaustive search on the remainder of the key by encrypting the plaintext (or decrypting the ciphertext) with the possible keys and then checking the resulting ciphertext (or plaintext) against the known value. Compare the time it takes to find the remaining key bits to an exhaustive search using the CPU. Determine what is a reasonable number of keys to test in parallel based on the GPU being used and the display size then set the number of known key bits accordingly. 3 Implement the modes of encryption described in Chapter 4 to run with AES in the GPU. The OpenGL version of AES was run in ECB mode. The other modes can easily be implemented in OpenGL. 4 Implement AES's key schedule in OpenGL. 5 Implement AES's decryption function in OpenGL. The code for the encryption function can be modified and the data layout shown in Figure 4.6 in Chapter 4 used. The tables for encryption provided in the appendix will have to be replaced with the tables needed for decryption. Decrypt the test value from FIPS 197 to verify the code is working. The test value is included in the encryption code in Appendix A. The following exercises involve experimenting with ideas described in previous chapters. 1 It may be feasible to design a symmetric key cipher (most likely a stream cipher) for GPUs. Experiment with graphic operations to produce a key stream and use it to encrypt images in a GPU. Test the randomness of the key stream bits. Note: h t t p : / / c s r c . n i s t . g o v / r n g / includes descriptions of tests for detecting non-randomness in binary sequences. Determine if the implementation works on different GPUs by encrypting the image on one GPU and decrypting it on another or by computing the key stream in
106
CRYPTOGRAPHICS
different GPUs, reading the pixels to the system memory and comparing the resulting bytes. Recall that rounding within a GPU may impact the exact pixel values produced. Therefore, an algorithm may not be portable amongst different GPUs unless steps are taken to avoid error due to rounding. 2 Implement a version of the method for keying the GPU described in Chapter 6 involving a user selecting from colored squares on the display. 3 The concept of encryption within devices on PCs can be extended to encrypt audio in programmable DSPs. Using a programmable DSP, implement a symmetric key cipher within the DSP and demonstrate the encryption and decryption of audio within the DSP.
Appendix A AES OpenGL Code for Encryption
A.l
Overview
This appendix contains code for an OpenGL version of the AES encryption function. The code encrypts identical copies of a 16-byte data block. The block of data and expanded key are predefined and written to the GPU. The pixel format used is 32 bits per pixel with 8 bits per color component. Two versions of the program are provided. The first version uses the red pixel component and the back buffer. It performs the operations in the back buffer then displays the final result to the front buffer. The second version uses the red, green and blue pixel components and the front buffer. It performs the operations in the front buffer, allowing the user to see the pixels being updated.
A.2
Version Using the Red Pixel Component and the Back Buffer
THIS SOFTWARE IS PROVIDED BY THE AUTHORS " A S I S " AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Example AES implementation using OpenGL and the red pixel component. All work is performed in the back buffer. The key expansion is not performed in the GPU. In this sample code, the data to be encrypted and the expanded key are predefined. They are taken from FIPS 197.
108
CRYPTOGRAPHICS
Press 'e' to trigger encryption. One block of the resulting ciphertext will be printed to the window from which the program is executed to allow the user to verify the data was correctly encrypted. This code runs with Microsoft Visual C++ and requires OpenGL and GLUT. The data block from FIPS 197 and its corresponding ciphertext: data: 32 88 31 eO 43 5a 31 37 ±6 30 98 07 a8 8d a2 34 ciphertext: 39 02 dc 19 25 dc 11 6a 84 09 85 Ob Id fb 97 32 The expanded key is defined below in the array ekey. * * * * * * * * * * * * * * * * * * * * * * * * 5 f C * * * *
* * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * *
#include <stdlib.h> #include #include <stdio.h> /* # of blocks encrypted simultaneously in one pixel component */ #define NBLKS 500 /* the expanded key is loaded starting at pixel (KEY_START_POS,0) */ #define KEY_START_POS 16 /* 16 bytes in 128 bit block */ #define BYTES_PER_BLK 16 /* 176 bytes in expanded key */ #define EKEY.BYTES 176 /* contains data that will be encrypted */ GLubyte data[BYTES_PER_BLK*NBLKS]; /* contains ciphertext */ GLubyte out_data[BYTES_PER_BLK*NBLKS]; /* expanded key */ GLubyte ekey[EKEY.BYTES] = { /* initial whitening */ 0x2b,0x28,Oxab,0x09,0x7e,Oxae,Oxf7,Oxcf, 0x15,0xd2,0x15,Ox4f,0x16,0xa6,0x88,0x3c, /* 1st roundkey */ OxaO,0x88,0x23,0x2a,Oxfa,0x54,0xa3,0x6c, Oxfe,0x2c,0x39,0x76,0x17,Oxbl,0x39,0x05, /* 2nd round key */ Oxf2,0x7a,0x59,0x73,0xc2,0x96,0x35,0x59, 0x95,0xb9,0x80,Oxf6,Oxf2,0x43,0x7a,0x7f, /* 3rd round key */ 0x3d,0x47,Oxle,0x6d,0x80,0x16,0x23,0x7a,
Appendix A: AES OpenGL Code for Encryption
109
0x47,Oxfe,0x7e,0x88,0x7d,0x3e,0x44,0x3b, / * 4 t h round key */ Oxef,0xa8,0xb6,Oxdb,0x44,0x52,0x71,OxOb, 0xa5,0x5b,0x25,Oxad,0x41,0x7f,0x3b,0x00, /* 5th round key */ Oxd4,0x7c,Oxca,0x11,Oxdl,0x83,Oxf2,Oxf9, 0xc6,0x9d,0xb8,0x15,Oxf8,0x87,Oxbc,Oxbc, /* 6th round key */ 0x6d,0x11,Oxdb,Oxca,0x88,OxOb,Oxf9,0x00, 0xa3,0x3e,0x86,0x93,0x7a,Oxfd,0x41,Oxfd, /* 7th round key */ 0x4e,0x5f,0x84,0x4e,0x54,0x5f,0xa6,0xa6, Oxf7,0xc9,0x4f,Oxdc,OxOe,Oxf3,0xb2,0x4f, /* 8th round key */ Oxea,0xb5,0x31,0x7f,0xd2,0x8d,0x2b,0x8d, 0x73,Oxba,Oxf5,0x29,0x21,0xd2,0x60,0x2f, /* 9th round key */ Oxac,0x19,0x28,0x57,0x77,Oxfa,Oxdl,0x5c, 0x66,Oxdc,0x29,0x00,Oxf3,0x21,0x41,0x6e, /* 10th round key */ OxdO,0xc9,Oxe1,0xb6,0x14,Oxee,0x3f,0x63, Oxf9,0x25,OxOc,OxOc,0xa8,0x89,0xc8,0xa6, }; /* The T tables are written as 3 tables (l*Sbox, 2*Sbox, 3*Sbox) in order to process data in 1 byte segments as a single pixel color component instead of processing 4 bytes. Values are converted to floating point by dividing by 255 then adding 0.000001. The addition of 0.000001 is because OpenGL stores the pixels as floating point values and truncates the values when converting from floating point to integer format. This conversion in format occurs when using a color component as an index into the color map. Therefore, each value needs to be >= the corresponding integer but less than the next integer to avoid errors due to rounding. 0, 1 are set to exactly 0 and 1. */ const GLfloat Tel[256] = { 0.388237,0.486276,0.466668,0.482354,0.949021,0.419609,0.435295,0.772550, 0.188236,0.003922,0.403923,0.168629,0.996080,0.843139,0.670590,0.462746, 0.792158,0.509805,0.788236,0.490197,0.980394,0.349020,0.278433,0.941177, 0.678432,0.831374,0.635296,0.686276,0.611766,0.643138,0.447059,0.752942, 0.717648,0.992158,0.576472,0.142039,0.211766,0.247060,0.968629,0.800001, 0.203923,0.647060,0.898040,0.945099,0.443138,0.847060,0.192158,0.082354, 0.015687,0.780393,0.137256,0.764707,0.094119,0.588237,0.019609,0.603923, 0.027452,0.070589,0.501962,0.886276,0.921570,0.152942,0.698041,0.458824, 0.035295,0.513727,0.172550,0.101961,0.105884,0.431373,0.352942,0.627452, 0.321569,0.231374,0.839217,0.701962,0.160785,0.890197,0.184315,0.517648, 0.325491,0.819609,0.000000,0.929413,0.125491,0.988236,0.694118,0.356864, 0.415687,0.796080,0.745100,0.223530,0.290197,0.298040,0.345099,0.811766, 0.815687,0.937256,0.666668,0.984315,0.262746,0.301962,0.200001,0.521569,
no
CRYPTOGRAPHICS
0.270588,0.976471,0. 007844,0. 498040,0. 313727,0. 235295,0. 623531,0. 658825, 0.317648,0.639217,0. 250980,0. 560786,0. 572551,0. 615687,0. 219609,0. 960785, 0.737256,0.713727,0. 854903,0. 129413,0. 062746,1. 000000,0. 952942,0. 823531, 0.803922,0.047060,0. 074511,0. 925491,0. 372550,0. 592158,0. 266668,0. 090197, 0.768628,0.654903,0. 494118,0. 239216,0. 392158,0. 364707,0. 098040,0. 450982, 0.376472,0.505883,0. 309805,0. 862746,0. 133334,0. 164706,0. 564707,0. 533334, 0.274511,0.933335,0. 721570,0. 078432,0. 870590,0. 368629,0. 043139,0. 858825, 0.878432,0. 196079,0. 227451,0. 039216,0. 286275,0. 023531,0. 141177,0. 360785, 0.760786,0.827452,0. 674511,0. 384314,0. 568628,0. 584315,0, 894119,0. 474511, 0.905884,0.784315,0. 215688,0. 427452,0. 552942,0. 835295,0, 305884,0. 662746, 0.423530,0. 337256,0. 956864,0. 917649,0. 396079,0. 478432,0, 682354,0. 031374, 0.729413,0.470589,0. 145099,0. 180394,0. 109805,0. 650982,0, 705883,0. 776472, 0.909805,0. 866667,0. 454903,0. 121570,0. 294119,0. 741177,0, 545099,0. 541178, 0.439217,0. 243139,0. 709805,0. 400001,0. 282354,0. 011766,0, 964707,0. 054903, 0.380393,0. 207844,0. 341178,0. 725491,0. 525492,0. 756864,0, 113726,0. 619609, 0.882354,0. 972550,0. 596079,0. 066667,0. 411765,0. 850981,0, 556864,0. 580393, 0.607844,0. 117649,0. 529413,0. 913726,0. 807845,0. 333334,0, 156863,0. 874511, 0.549021,0. 631373,0. 537256,0. 050981,0. 749021,0. 901962,0, 258824,0. 407844, 0.254903,0, 600001,0, 176471,0. 058825,0. 690197,0, 329413,0, 733335,0. 086276 }; const GLfloat Te2 [256] = { 0.776472,0.972550 ,0.933335,0. 964707 ,1.000000,0.839217,0. 870590,0. 568628, 0.376472,0.007844 ,0.807845,0.337256 ,0.905884,0.709805,0. 301962,0. 925491, 0.560786,0.121570 ,0.537256,0.980394 ,0.937256,0. 698041,0. 556864,0. 984315, 0.254903,0.701962 ,0.372550,0.270588 ,0.137256,0.325491,0. 894119,0. 607844, 0.458824,0.882354 ,0.239216,0.298040 ,0.423530,0.494118,0. 960785,0. 513727, 0.407844,0.317648 ,0.819609,0. 976471 ,0.886276,0.670590,0. 384314,0. 164706, 0.031374,0.584315 ,0.274511,0.615687 ,0.188236,0. 215688,0. 039216,0. 184315, 0.054903,0.141177 ,0.105884,0.874511 ,0.803922,0.305884,0. 498040,0. 917649, 0.070589,0.113726 ,0.345099,0. 203923 ,0.211766,0.862746,0. 705883,0..356864, 0.643138,0.462746 ,0.717648,0.490197 ,0.321569,0.866667,0. 368629,0..074511, 0.650982,0.725491 ,0.000000,0.756864 ,0.250980,0.890197,0. 474511,0..713727, 0.831374,0.552942 ,0.403923,0. 447059 ,0.580393,0.596079,0. 690197,0..521569, 0.733335,0.772550 ,0.309805,0. 929413 ,0.525492,0.603923,0, 400001,0..066667, 0.541178,0.913726 ,0.015687,0. 996080 ,0.627452,0.470589,0. 145099,0..294119, 0.635296,0.364707 ,0.501962,0. 019609 ,0.247060,0. 129413,0, 439217,0..945099, 0.388237,0.466668 ,0.686276,0. 258824 ,0.125491,0.898040,0, 992158,0..749021, 0.505883,0.094119 ,0.149020,0. 764707 ,0.745100,0.207844,0, 533334,0..180394, 0.576472,0.333334 ,0.988236,0. 478432 ,0.784315,0.729413,0, 196079,0..901962, 0.752942,0.098040 ,0.619609,0. 639217 ,0.266668,0.329413,0, 231374,0..043139, 0.549021,0.780393 ,0.419609,0. 152157 ,0.654903,0.737256,0, 086276,0..678432, 0.858825,0.392158 ,0.454903,0. 078432 ,0.572551,0.047060,0, 282354,0..721570, 0.623531,0.741177 ,0.262746,0. 768628 ,0.223530,0. 192158,0, 827452,0..949021, 0.835295,0.545099 ,0.431373,0. 854903 ,0.003922,0.694118,0, 611766,0..286275, 0.847060,0.674511 ,0.952942,0. 811766 ,0.792158,0.956864,0, 278433,0..062746, 0.435295,0.941177 ,0.290197,0. 360785 ,0.219609,0.341178,0, 450982,0..592158, 0.796080,0.631373 ,0.909805,0. 243139 ,0.588237,0.380393,0, 050981,0..058825, 0.878432,0.486276 ,0.443138,0. 800001 ,0.564707,0. 023531,0 968629,0..109805, 0.760786,0.415687 ,0.682354,0. 411765 ,0.090197,0. 600001,0 227451,0 .152942,
Appendix A: AES OpenGL Code for Encryption
111
0.850981,0.921570,0.168629,0.133334,0.823531,0.662746,0.027452,0.200001, 0.176471,0.235295,0.082354,0.788236,0.529413,0.QQQ66Q,0.313727,0.647060, 0.011766,0.349020,0.035295,0.101961,0.396079,0.843139,0.517648,0.815687, 0.509805,0.160785,0.352942,0.117649,0.482354,0.658825,0.427452,0.172550 }; const GLfloat Te3[256] = { 0.647060,0.517648,0.600001,0.552942,0.050981,0.741177,0.694118,0.329413, 0.313727,0.011766,0.662746,0.490197,0.098040,0.384314,0.901962,0.603923, 0.270588,0.615687,0.250980,0.529413,0.082354,0.921570,0.788236,0.043139, 0.925491,0.403923,0.992158,0.917649,0.749021,0.968629,0.588237,0.356864, 0.760786,0.109805,0.682354,0.415687,0.352942,0.254903,0.007844,0.309805, 0.360785,0.956864,0.203923,0.031374,0.576472,0.450982,0.325491,0.247060, 0.047060,0.321569,0.396079,0.368629,0.152157,0.631373,0.058825,0.709805, 0.035295,0.211766,0.607844,0.239216,0.142039,0.411765,0.803922,0.623531, 0.105884,0.619609,0.454903,0.180394,0.176471,0.698041,0.933335,0.984315, 0.964707,0.301962,0.380393,0.807845,0.482354,0.243139,0.443138,0.592158, 0.960785,0.407844,0.000000,0.172550,0.376472,0.121570,0.784315,0.929413, 0.745100,0.274511,0.850981,0.294119,0.870590,0.831374,0.909805,0.290197, 0.419609,0.164706,0.898040,0.086276,0.772550,0.843139,0.333334,0.580393, 0.811766,0.062746,0.023531,0.505883,0.941177,0.266668,0.729413,0.890197, 0.952942,0.996080,0.752942,0.541178,0.678432,0.737256,0.282354,0.015687, 0.874511,0.756864,0.458824,0.388237,0.188236,0.101961,0.054903,0.427452, 0.298040,0.078432,0.207844,0.184315,0.882354,0.635296,0.800001,0.223530, 0.341178,0.949021,0.509805,0.278433,0.674511,0.905884,0.168629,0.584315, 0.627452,0.596079,0.819609,0.498040,0.400001,0.494118,0.670590,0.513727, 0.792158,0.160785,0.827452,0.235295,0.474511,0.886276,0.113726,0.462746, 0.231374,0.337256,0.305884,0.117649,0.858825,0.039216,0.423530,0.894119, 0.364707,0.431373,0.937256,0.650982,0.658825,0.643138,0.215688,0.545099, 0.196079,0.262746,0.349020,0.717648,0.549021,0.392158,0.823531,0.878432, 0.705883,0.980394,0.027452,0.145099,0.686276,0.556864,0.913726,0.094119, 0.835295,0.533334,0.435295,0.447059,0.141177,0.945099,0.780393,0.317648, 0.137256,0.486276,0.611766,0.129413,0.866667,0.862746,0.525492,0.521569, 0.564707,0.258824,0.768628,0.Q66eeS,0.847060,0.019609,0.003922,0.070589, 0.639217,0.372550,0.976471,0.815687,0.568628,0.345099,0.152942,0.725491, 0.219609,0.074511,0.701962,0.200001,0.733335,0.439217,0.537256,0.654903, 0.713727,0.133334,0.572551,0.125491,0.286275,1.000000,0.470589,0.478432, 0.560786,0.972550,0.501962,0.090197,0.854903,0.192158,0.776472,0.721570, 0.764707,0.690197,0.466668,0.066667,0.796080,0.988236,0.839217,0.227451 }; / * creates NBLKS copies of 16 byte t e s t data * / void maketestdata(void) { int i=0; int cnt=0; for (cnt=0; cnt < NBLKS; ++cnt) { i = 16*cnt; data[i+0] = (GLubyte) 0x32; data[i+l] = (GLubyte) 0x88; data[i+2] = (GLubyte) 0x31;
112
CRYPTOGRAPHICS
d a t a [ i + 3 ] == (GLubyte) OxeO; d a t a [ i + 4 ] == (GLubyte) 0x43; d a t a [ i + 5 ] =•• (GLubyte) 0x5a; d a t a [ i + 6 ] == (GLubyte) 0x31; d a t a [ i + 7 ] == (GLubyte) 0x37; d a t a [ i + 8 ] == (GLubyte) 0xf6; d a t a [ i + 9 ] =•• (GLubyte) 0x30; d a t a [ i + 1 0 ] = (GLubyte) 0x98; d a t a [ i + l l ] = (GLubyte) 0x07; d a t a [ i + 1 2 ] = (GLubyte) 0xa8; d a t a [ i + 1 3 ] = (GLubyte) 0x8d; d a t a [ i + 1 4 ] = (GLubyte) 0xa2; d a t a [ i + 1 5 ] = (GLubyte) 0x34; / / end of f o r i } / * end of maJcetestdata * / /* helper function for encryption */ void add_layer(int dxl,int dyl,int sxl,int syl,int wl,int hi, int dx2,int dy2,int sx2,int sy2,int w2,int h2) { glRasterPos2i(dxl,dyl); glCopyPixels(sxl,syl,wl,hl,GL_COLOR); glRasterPos2i(dx2,dy2); glCopyPixels(sx2,sy2,w2,h2,GL_COLOR); } /* encryption */ void encrypt(void) { int r = 0; int k; int key.ind = KEY_START_POS; int nuin_rnds = 9; int cnt = 0; /* index used in print statements */ /* disable logical operations and color maps when reading in data and key */ glDisable(GL_COLOR_LOGIC_OP); glPixelTransferi(GL_MAP_COLOR,0); /* load expanded key at (KEY_START_POS,0) in RED pixel component NBLKS copies (rows) of expanded key are needed */ for (k = 0; k < NBLKS; ++k) { glRasterPos2i(KEY_START_POS,k); glDrawPixels(EKEY.BYTES,1,GL.RED,GL_UNSIGNED_BYTE,ekey); } // end of for k /* load data at (0,0) into RED pixel component */ glRasterPos2i(0,0); glDrawPixels(BYTES_PER_BLK,NBLKS,GL_RED,GL_UNSIGNED_BYTE,data); /* perform first xor with key */
Appendix A: AES OpenGL Code for Encryption glEnable(GL_COLOR_LOGIC_OP); glLogicOp(GL_XOR); glRasterPos2i(0,0); glCopyPixels(KEY_START_POS,0,16,NBLKS,GL_COLOR); glDisable(GL_COLOR_LOGIC_OP); /* start of round */ /* compute 1*,2*,3* Sbox of each byte */ num.rnds = 9; for (r = 0; r < num.rnds; ++r) { glPixelTransferi(GL_MAP_COLOR,1); glPixelMapfV(GL_PIXEL_MAP_R_TO_R,256,Te1); glRasterPos2i(192,0); /* destination of copy */ glCopyPixels(0,0,16,NBLKS,GL.COLOR); glPixelMapfV(GL_PIXEL_MAP_R_TO_R,256,Te2); glRasterPos2i(208,0); /* destination of copy */ glCopyPixels(0,0,16,NBLKS,GL.COLOR); glPixelMapfV(GL_PIXEL_MAP_R_TO_R,256,Te3); glRasterPos2i(224,0); /* destination of copy */ glCopyPixels(0,0,16,NBLKS,GL.COLOR); glPixelTransferi(GL_MAP_COLOR,0); /* turn mapping off */ /* 1st term of XOR */ /* CopyPixels create rows 1,2,3,4 in order corresponding to 2*,1*,1*,3* S-Box entries respectively*/ glRasterPos2i(0,0); glCopyPixels(208,0,4,NBLKS,GL_COLOR); glRasterPos2i(4,0); glCopyPixels(192,0,4,NBLKS,GL.COLOR); glRasterPos2i(8,0); glCopyPixels(192,0,4,NBLKS,GL.COLOR); glRasterPos2i(12,0); glCopyPixels(224,0,4,NBLKS,GL_COLOR); /* turn xor on */ glEnable(GL_COLOR_LOGIC_OP); glLogicOp(GL_XOR); /* 2nd term of XOR */ /* creates rows 1,2,3,4 in order corresponding to 3*,2*,1*,1* S-Box entries respectively */ add.layer(0,0,229,0,3,NBLKS,3,0,228,0,1,NBLKS); add.layer(4,0,213,0,3,NBLKS,7,0,212,0,1,NBLKS); add.layer(8,0,197,0,3,NBLKS,11,0,196,0,1,NBLKS); add.layer(12,0,197,0,3,NBLKS,15,0,196,0,1,NBLKS); /* 3rd term of XOR */
113
114
CRYPTOGRAPHICS /* creates rows 1,2,3,4 in order corresponding to 1*,3*,2*,1* S-Box entries respectively */ add_layer(0,0,202,0,2,NBLKS,2,0,200,0,2,NBLKS); add_layer(4,0,234,0,2,NBLKS,6,0,232,0,2,NBLKS); add.layer(8,0,218,0,2,NBLKS,10,0,216,0,2,NBLKS); add.layer(12,0,202,0,2,NBLKS,14,0,200,0,2,NBLKS); /* 4th term of XOR */ /* creates rows 1,2,3,4 in order corresponding to 1*,1*,3*,2* S-Box entries respectively */ add_layer(0,0,207,0,l,NBLKS,l,0,204,0,3,NBLKS); add_layer(4,0,207,0,l,NBLKS,5,0,204,0,3,NBLKS); add_layer(8,0,239,0,l,NBLKS,9,0,236,0,3,NBLKS); add_layer(12,0,223,0,1,NBLKS,13,0,220,0,3,NBLKS); /* xor with round key */ key_ind = key_ind + 16; glRasterPos2i(0,0); glCopyPixels(key.ind,0,16,NBLKS,GL.COLOR); /* turn XOR off before starting next round */ glDisable(GL_COLOR_LOGIC_OP);
} /* end for r */ /* last round Sbox, ShiftRows and XOR with round key */ glDisable(GL_COLOR_LOGIC_OP); /* SBox */ glPixelTransferi(GL_MAP_COLOR,l); glPixelMapfv(GL_PIXEL_MAP_R_T0_R,256,Tel); glRasterPos2i(192,0); /* destination of copy */ glCopyPixels(0,0,16,NBLKS,GL.COLOR); /* ShiftRows */ glPixelTransferi(GL_MAP_COLOR,0); glRasterPos2i(0,0); glCopyPixels(192,0,4,NBLKS,GL.COLOR); add.layer(4,0,197,0,3,NBLKS,7,0,196,0,1,NBLKS); add.layer(8,0,202,0,2,NBLKS,10,0,200,0,2,NBLKS); add.layer(12,0,207,0,1,NBLKS,13,0,204,0,3,NBLKS); /* xor with round key */ glEnable(GL_COLOR_LOGIC_OP); glLogicOp(GL_XOR); key_ind = key_ind + 16; glRasterPos2i(0,0); glCopyPixels(key_ind,0,16,NBLKS,GL_COLOR); /* read buffer to system memory */
Appendix A: AES OpenGL Code for Encryption
115
glReadPixels(0,0,BYTES_PER_BLK,NBLKS,GL_RED,GL_UNSIGNED_BYTE,out.data); / * p r i n t 1 l i n e of r e s u l t s * / for (cnt=0; cnt < 16; ++cnt) { p r i n t f ("7oX " , o u t _ d a t a [ c n t ] ) ; } printf ("\n"); } / * end of encrypt*/ void i n i t ( v o i d ) { / * d i t h e r i n g needs to be off to prevent p i x e l s from being averaged with neighbors and set a l l p i x e l s t o 0 * / glDisable(GL_DITHER); glClearColor(0.0,0.0,0.0,0.0); glClearDepth(l.O); /* to simplify indexing: set raster positions to correspond to pixels, 0,0 = lower left */ glMatrixMode(GL_PROJECTION); glLoadldentityO ; gluOrtho2D(0.0,300.0, 0.0, 510.0); glMatrixMode(GL_MODELVIEW); glLoadldentityO ; /* set data transfers from/to system to use back buffer */ glDrawBuffer(GL_BACK); glReadBuffer(GL_BACK); /* create the test data */ maketestdataO ; /* set alignment for data storage */ glPixelStorei(GL_UNPACK_ALIGNMENT,1); } /* end of init */ /* display just clears the buffer */ void display(void) { glClear(GL_COLOR_BUFFER_BIT); glFlushO; } /* end of display */ /* pressing "e" will run the encryption function */ void Key(unsigned char pressedkey,int x, int y) { switch(pressedkey) { case 'e': encrypt(); glFlushO ; break; default: break; } }
116
CRYPTOGRAPHICS
/* end of Key */ int main(int argc, char **argv) { const GLubyte *ver_str; glutlnit(&argc, argv); glutInitDisplayMode(GLUT_DOUBLE I GLUT.RGB); glutInitWindowSize(300,510); glutInitWindowPosition(50,10); glutCreateWindowC'aes"); glutKeyboardFunc(Key); initO; /* print OpenGL version used */ ver_str = glGetString(GL_VERSION); fprintf (stderr, "version '/s \n" ,ver_str); fprintf(stderr,"Press e to encrypt.\n"); glutDisplayFunc(display); glutMainLoopO ; return 0;
A.3
Version Using the RGB Pixel Components and the Front Buffer
THIS SOFTWARE IS PROVIDED BY THE AUTHORS ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Example AES implementation using OpenGL and the RGB pixel components. All work is performed in the front buffer. The key expansion is not performed in the GPU. In this sample code the data to be encrypted and the expanded key are predefined. They are taken from FIPS 197. For each color component, one block of the resulting ciphertext will be printed to the window from which the program is executed to allow the user to verify the data was correctly encrypted.
Appendix A: AES OpenGL Code for Encryption This code runs with Microsoft Visual C++ and requires OpenGL and GLUT. The data block from FTPS 197 and its corresponding ciphertext: data: 32 88 31 eO 43 5a 31 37 ±6 30 98 07 a8 8d a2 34 ciphertext: 39 02 dc 19 25 dc 11 6a 84 09 85 Ob Id fb 97 32 The expanded key is defined below in the array ekey.
#include <stdlib.h> #include #include <stdio.h> /* number of blocks encrypted simultaneously in one pixel component total # of blocks encrypted simultcineously is 3*NBLK
*/ #define NBLK 300 /* the expanded key is loaded starting at pixel (KEY_START_POS,0) */ #define KEY_START_POS 16 #define BYTES_PER_BLK 16 /* 16 bytes in 128 bit block */ #define EKEY.BYTES 176 /* 176 bytes in expanded key */ /* temp array used to create multiple data copies */ GLubyte tmpdata[BYTES_PER_BLK*NBLK] ; /* data[i][j] contains input bytes and is read into the jth component of the pixels */ GLubyte data[BYTES_PER_BLK*NBLK][3]; /expanded key */ GLubyte ekey [EKEY.BYTES]; /* expanded key, 1 copy in each of RGB */ GLubyte rgba.ekey[EKEY_BYTES][3]; /* 16 bytes of output per block*/ GLubyte out.data[BYTES_PER_BLK*NBLK][3]; /* contains one data block of output, used to verify the ciphertext */ GLubyte out_red[BYTES_PER_BLK]; GLubyte out.green[BYTES_PER_BLK]; GLubyte out.blue[BYTES.PER.BLK]; /* The T tables are written as 3 tables (l*Sbox, 2*Sbox, 3*Sbox) in order to process data in 1 byte segments as a single pixel
111
118
CRYPTOGRAPHICS
color component instead of processing 4 bytes. Values are converted to floating point by dividing by 255 then adding 0.000001. The addition of 0.000001 is because OpenGL stores the pixels as floating point values and truncates the values when converting from floating point to integer format. This conversion in format occurs when using a color component as an index into the color map. Therefore, each value needs to be >= the corresponding integer but less than the next integer to avoid errors due to rounding. 0, 1 are set to exactly 0 and 1. */ static const GLfloat Tel[256] = { 0.388237,0 .486276 .466668,0.482354,0. 949021,0. 419609,0 .435295,0 .772550, 0.188236,0 .003922 .403923,0. 168629,0. 996080,0. 843139,0 .670590,0 .462746, 0.792158,0 .509805 .788236,0.490197,0. 980394,0. 349020,0 .278433,0 .941177, 0.678432,0 .831374 .635296,0. 686276,0. 611766,0. 643138,0 .447059,0 .752942, 0.717648,0 .992158 .576472,0. 142039,0. 211766,0. 247060,0 .968629,0 .800001, 0.203923,0 .647060 .898040,0. 945099,0. 443138,0. 847060,0 .192158,0 .082354, 0.015687,0 .780393 .137256,0.764707,0. 094119,0. 588237,0 .019609,0 .603923, 0.027452,0 .070589 .501962,0. 886276,0. 921570,0. 152942,0 .698041,0 .458824, 0.035295,0 .513727 .172550,0. 101961,0. 105884,0. 431373,0 .352942,0 .627452, 0.321569,0 .231374 .839217,0. 701962,0. 160785,0. 890197,0 .184315,0 .517648, 0.325491,0 .819609 .000000,0. 929413,0. 125491,0. 988236,0 .694118,0 .356864, 0.415687,0 .796080 .745100,0. 223530,0. 290197,0. 298040,0 .345099,0 .811766, 0.815687,0 .937256 .666668,0. 984315,0. 262746,0. 301962,0 .200001,0 .521569, 0.270588,0 .976471 .007844,0.498040,0. 313727,0. 235295,0 .623531,0 .658825, 0.317648,0 .639217 .250980,0. 560786,0. 572551,0. 615687,0 .219609,0 .960785, 0.737256,0 .713727 .854903,0. 129413,0. 062746,1. 000000,0 .952942,0 .823531, 0.803922,0 .047060 .074511,0. 925491,0. 372550,0. 592158,0 .266668,0 .090197, 0.768628,0 .654903 .494118,0. 239216,0. 392158,0. 364707,0 .098040,0 .450982, 0.376472,0 .505883 .309805,0. 862746,0. 133334,0. 164706,0 .564707,0 .533334, 0.274511,0 .933335 .721570,0. 078432,0, 870590,0, 368629,0 .043139,0 .858825, 0.878432,0 .196079 .227451,0. 039216,0, 286275,0, 023531,0 .141177,0 .360785, 0.760786,0 .827452 .674511,0. 384314,0, 568628,0, 584315,0 .894119,0 .474511, 0.905884,0 .784315 .215688,0. 427452,0, 552942,0, 835295,0 .305884,0 .662746, 0.423530,0 .337256 .956864,0. 917649,0, 396079,0, 478432,0 .682354,0 .031374, 0.729413,0 .470589 .145099,0. 180394,0, 109805,0, 650982,0 .705883,0 .776472, 0.909805,0 .866667 .454903,0. 121570,0, 294119,0, 741177,0 .545099,0 .541178, 0.439217,0 .243139 .709805,0. 400001,0, 282354,0, 011766,0 .964707,0 .054903, 0.380393,0 .207844 .341178,0, 725491,0, 525492,0, 756864,0 .113726,0 .619609, 0.882354,0 .972550 .596079,0, 066667,0, 411765,0, 850981,0 .556864,0 .580393, 0.607844,0 .117649 .529413,0, 913726,0, 807845,0, 333334,0 .156863,0 .874511, 0.549021,0 .631373 .537256,0, 050981,0, 749021,0, 901962,0 .258824,0 .407844, 0.254903,0 .600001 .176471,0, 058825,0, 690197,0, 329413,0 .733335,0 .086276 }; static const GLfloat Te2[256] = { 0.776472,0.972550,0.933335,0.964707,1.000000,0.839217,0.870590,0.568628, 0.376472,0.007844,0.807845,0.337256,0.905884,0.709805,0.301962,0.925491, 0.560786,0.121570,0.537256,0.980394,0.937256,0.698041,0.556864,0.984315, 0.254903,0.701962,0.372550,0.270588,0.137256,0.325491,0.894119,0.607844,
Appendix A: AES OpenGL Code for Encryption
119
0.458824,0.882354,0.239216,0.298040,0.423530,0.494118,0.960785,0.513727, 0.407844,0.317648,0.819609,0.976471,0.886276,0.670590,0.384314,0.164706, 0.031374,0.584315,0.274511,0.615687,0.188236,0.215688,0.039216,0.184315, 0.054903,0.141177,0.105884,0.874511,0.803922,0.305884,0.498040,0.917649, 0.070589,0.113726,0.345099,0.203923,0.211766,0.862746,0.705883,0.356864, 0.643138,0.462746,0.717648,0.490197,0.321569,0.866667,0.368629,0.074511, 0.650982,0.725491,0.000000,0.756864,0.250980,0.890197,0.474511,0.713727, 0.831374,0.552942,0.403923,0.447059,0.580393,0.596079,0.690197,0.521569, 0.733335,0.772550,0.309805,0.929413,0.525492,0.603923,0.400001,0.066667, 0.541178,0.913726,0.015687,0.996080,0.627452,0.470589,0.145099,0.294119, 0.635296,0.364707,0.501962,0.019609,0.247060,0.129413,0.439217,0.945099, 0.388237,0.466668,0.686276,0.258824,0.125491,0.898040,0.992158,0.749021, 0.505883,0.094119,0.149020,0.764707,0.745100,0.207844,0.533334,0.180394, 0.576472,0.333334,0.988236,0.478432,0.784315,0.729413,0.196079,0.901962, 0.752942,0.098040,0.619609,0.639217,0.266668,0.329413,0.231374,0.043139, 0.549021,0.780393,0.419609,0.152157,0.654903,0.737256,0.086276,0.678432, 0.858825,0.392158,0.454903,0.078432,0.572551,0.047060,0.282354,0.721570, 0.623531,0.741177,0.262746,0.768628,0.223530,0.192158,0.827452,0.949021, 0.835295,0.545099,0.431373,0.854903,0.003922,0.694118,0.611766,0.286275, 0.847060,0.674511,0.952942,0.811766,0.792158,0.956864,0.278433,0.062746, 0.435295,0.941177,0.290197,0.360785,0.219609,0.341178,0.450982,0.592158, 0.796080,0.631373,0.909805,0.243139,0.588237,0.380393,0.050981,0.058825, 0.878432,0.486276,0.443138,0.800001,0.564707,0.023531,0.968629,0.109805, 0.760786,0.415687,0.682354,0.411765,0.090197,0.600001,0.227451,0.152942, 0.850981,0.921570,0.168629,0.133334,0.823531,0.662746,0.027452,0.200001, 0.176471,0.235295,0.082354,0.788236,0.529413,0.666668,0.313727,0.647060, 0.011766,0.349020,0.035295,0.101961,0.396079,0.843139,0.517648,0.815687, 0.509805,0.160785,0.352942,0.117649,0.482354,0.658825,0.427452,0.172550 }; static const GLfloat Te3[256] = { 0.647060,0.517648,0.600001,0.552942,0.050981,0.741177,0.694118,0.329413, 0.313727,0.011766,0.662746,0.490197,0.098040,0.384314,0.901962,0.603923, 0.270588,0.615687,0.250980,0.529413,0.082354,0.921570,0.788236,0.043139, 0.925491,0.403923,0.992158,0.917649,0.749021,0.968629,0.588237,0.356864, 0.760786,0.109805,0.682354,0.415687,0.352942,0.254903,0.007844,0.309805, 0.360785,0.956864,0.203923,0.031374,0.576472,0.450982,0.325491,0.247060, 0.047060,0.321569,0.396079,0.368629,0.152157,0.631373,0.058825,0.709805, 0.035295,0.211766,0.607844,0.239216,0.142039,0.411765,0.803922,0.623531, 0.105884,0.619609,0.454903,0.180394,0.176471,0.698041,0.933335,0.984315, 0.964707,0.301962,0.380393,0.807845,0.482354,0.243139,0.443138,0.592158, 0.960785,0.407844,0.000000,0.172550,0.376472,0.121570,0.784315,0.929413, 0.745100,0.274511,0.850981,0.294119,0.870590,0.831374,0.909805,0.290197, 0.419609,0.164706,0.898040,0.086276,0.772550,0.843139,0.333334,0.580393, 0.811766,0.062746,0.023531,0.505883,0.941177,0.266668,0.729413,0.890197, 0.952942,0.996080,0.752942,0.541178,0.678432,0.737256,0.282354,0.015687, 0.874511,0.756864,0.458824,0.388237,0.188236,0.101961,0.054903,0.427452, 0.298040,0.078432,0.207844,0.184315,0.882354,0.635296,0.800001,0.223530, 0.341178,0.949021,0.509805,0.278433,0.674511,0.905884,0.168629,0.584315, 0.627452,0.596079,0.819609,0.498040,0.400001,0.494118,0.670590,0.513727,
120
CRYPTOGRAPHICS
0.792158,0. 160785,0, 827452,0, 235295,0, 474511,0. 886276,0. 113726,0 .462746, 0.231374,0.337256,0, 305884,0. 117649,0, 858825,0. 039216,0. 423530,0 .894119, 0.364707,0.431373,0, 937256,0, 650982,0, 658825,0. 643138,0. 215688,0 .545099, 0.196079,0.262746,0, 349020,0, 717648,0, 549021,0. 392158,0. 823531,0 .878432, 0.705883,0.980394,0, 027452,0, 145099,0, 686276,0. 556864,0. 913726,0 .094119, 0.835295,0.533334,0, 435295,0. 447059,0, 141177,0. 945099,0. 780393,0 .317648, 0.137256,0.486276,0, 611766,0. 129413,0, 866667,0. 862746,0. 525492,0 .521569, 0.564707,0.258824,0, 768628,0, 666668,0, 847060,0. 019609,0. 003922,0 .070589, 0.639217,0.372550,0, 976471,0, 815687,0 568628,0. 345099,0. 152942,0 .725491, 0.219609,0.074511,0, 701962,0, 200001,0 733335,0. 439217,0. 537256,0 .654903, 0.713727,0. 133334,0 572551,0, 125491,0 286275,1. 000000,0, 470589,0 .478432, 0.560786,0.972550,0 501962,0, 090197,0 854903,0. 192158,0, 776472,0 .721570, 0.764707,0.690197,0 466668,0, 066667,0 796080,0. 988236,0, 839217,0 .227451 }; / * c r e a t e s NBLK copies of 16 byt e test data */ void mcLketestdata(void) { i n t i=0; int c n t , j ; for (cnt=0; cnt < NBLK; ++cnt ) { i = 16*cnt; for (j=0; j < 3; ++j) i (GLubyte) 0x32; data[i+0] [j] (GLubyte) 0x88; d a t a [ i + l ] [j] (GLubyte) 0x31; data[i+2] [j] (GLubyte) OxeO; data[i+3] [j] (GLubyte) 0x43; data[i+4] [j] (GLubyte) 0x5a; data[i+5] [j] (GLubyte) 0x31; data[i+6] [j] (GLubyte) 0x37; data[i+7] [j] data[i+8] [j] • (GLubyte) 0xf6; data[i+9] [j] = (GLubyte) 0x30; dataCi+lO] [j] = (GLubyte) 0x98 d a t a [ i + l l ] [j] = (GLubyte) 0x07 data[i+12] [j] = (GLubyte) 0xa8 data[i+13] [j] = (GLubyte) 0x8d data[i+14] [j] = (GLubyte) 0xa2 data[i+15] [j] = (GLubyte) 0x34 } } / / end of for cnt } / * end of maJket est data * / /* expanded key written 1 entry per line for readability */ void mcLketestekey(void) { int i,j; /* initial whitening */ ekeyCO] = (GLubyte) 0x2b; ekeyCl] = (GLubyte) 0x28;
Appendix
A: AES OpenGL Code for (GLubyte) (GLubyte) (GLubyte) (GLubyte) (GLubyte) (GLubyte) (GLubyte) (GLubyte) (GLubyte (GLubyte (GLubyte (GLubyte (GLubyte (GLubyte
Oxab; 0x09; 0x7e; Oxae; 0xf7; Oxcf ; 0x15; 0xd2; 0x15 Ox4f 0x16 0xa6 0x88 0x3c
/ * 1st roundkey */ ekey[16] = (GLubyte (GLubyte ekey [17] (GLubyte ekey[18] (GLubyte ekey[19] (GLubyte ekey[20] (GLubyte ekey [21] (GLubyte ekey [22] (GLubyte ekey [23] (GLubyte ekey[24] (GLubyte ekey[25] (GLubyte ekey[26] (GLubyte ekey [27] (GLubyte ekey[28] (GLubyte ekey[29] (GLubyte ekey[30] (GLubyte ekey [31]
OxaO 0x88 0x23 0x2a Oxfa 0x54 0xa3 0x6c Oxfe 0x2c 0x39 0x76 0x17 Oxbl 0x39 0x05
/* 2nd round key */ ekey [32] = (GLubyte; (GLubyte ekey[33] (GLubyte ekey[34] (GLubyte ekey[35] (GLubyte ekey[36] (GLubyte ekey[37] (GLubyte ekey [38] (GLubyte ekey [39] (GLubyte ekey [40] (GLubyte ekey [41] (GLubyte ekey [42] (GLubyte ekey[43] (GLubyte ekey[44] (GLubyte ekey [45] (GLubyte ekey [46] (GLubyte ekey[47]
0xf2 0x7a 0x59 0x73 0xc2 0x96 0x35 0x59 0x95 0xb9 0x80 0xf6 0xf2 0x43 0x7a 0x7f
ekey[2] = ekey[3] = ekey[4] = ekey[5] = ekey [6] = ekey [7] = ekey[8] = ekey[9] = ekey[10] = ekey[11] = ekey [12] = ekey [13] = ekey [14] = ekey[15] =
Encryption
121
122
CRYPTOGRAPHICS
/ * 3rd round key * / ekey[48] = (GLubyte ekey[49] (GLubyte ekey[50] (GLubyte ekey [51] (GLubyte (GLubyte ekey [52] (GLubyte ekey[53] (GLubyte ekey[54] (GLubyte ekey [55] (GLubyte ekey [56] (GLubyte ekey [57] (GLubyte ekey [58] (GLubyte ekey [59] (GLubyte ekey[60] (GLubyte ekey [61] (GLubyte ekey [62] (GLubyte ekey [63]
0x3d 0x47 Oxle 0x6d 0x80 0x16 0x23 0x7a 0x47 Oxfe 0x7e 0x88 0x7d 0x3e 0x44 0x3b
/* 4th round key */ ekey [64] = (GLubyte (GLubyte ekey [65] (GLubyte ekey [66] (GLubyte ekey[67] (GLubyte ekey[68] (GLubyte ekey[69] (GLubyte ekey [70] (GLubyte ekey [71] (GLubyte ekey[72] (GLubyte ekey [73] (GLubyte ekey [74] (GLubyte ekey [75] (GLubyte ekey[76] (GLubyte ekey[77] (GLubyte ekey[78] (GLubyte ekey [79]
Oxef 0xa8 0xb6 Oxdb 0x44 0x52 0x71 OxOb 0xa5 0x5b 0x25 Oxad 0x41 0x7f 0x3b 0x00
/* 5th round key */ ekey[80] = (GLubyte ekey[81] = (GLubyte ekey[82] = (GLubyte ekey[83] = (GLubyte ekey[84] = (GLubyte ekey[85] = (GLubyte (GLubyte ekey[86] (GLubyte ekey[87] (GLubyte ekey[88] (GLubyte ekey [89] (GLubyte ekey [90] (GLubyte ekey [91]
0xd4 0x7c Oxca 0x11 Oxdl 0x83 0xf2 0xf9 0xc6 0x9d 0xb8 0x15
Appendix A: AES OpenGL Code for Encryption ekey[92] ekey[93] ekey[94] ekey[95]
= = = =
(GLubyte) (GLubyte) (GLubyte) (GLubyte)
0xf8; 0x87; Oxbc; Oxbc;
/* 6th round key */ ekey[96] = (GLubyte) 0x6d ekey[97] = (GLubyte) 0x11 ekey[98] = (GLubyte) Oxdb ekey[99] = (GLubyte) Oxca ekeyElOO] = (GLubyte 0x88 (GLubyte OxOb ekeyElOl] (GLubyte 0xf9 ekey[102] (GLubyte 0x00 ekey[103] (GLubyte' 0xa3 ekey[104] (GLubyte; 0x3e ekey[105] (GLubyte; 0x86 ekey[106] (GLubyte 0x93 ekey[107] (GLubyte; 0x7a ekey[108] (GLubyte Oxfd ekey[109] (GLubyte 0x41 ekeyLllO] (GLubyte Oxfd ekeyElll] /* 7th round key */ ekey[112] = (GLubyte ekey[113] = (GLubyte ekey[114] = (GLubyte ekey[115] = (GLubyte ekey[116] = (GLubyte ekey[117] = (GLubyte ekey[118] = (GLubyte ekey[119] = (GLubyte; ekey[120] = (GLubyte; ekey[121] = (GLubyte ekeyCl22] = (GLubyte ekey[123] = (GLubyte ekey[124] = (GLubyte ekey[125] = (GLubyte ekey[126] = (GLubyte ekey[127] = (GLubyte
0x4e 0x5f 0x84 0x4e 0x54 0x5f 0xa6 0xa6 0xf7 0xc9 0x4f Oxdc OxOe 0xf3 0xb2 Ox4f
/* 8th round key */ ekey[128] = (GLubyte ekey[129] = (GLubyte ekey[130] = (GLubyte ekey[131] = (GLubyte ekey[132] = (GLubyte ekey[133] = (GLubyte ekey[134] = (GLubyte ekey[135] = (GLubyte
Oxea 0xb5 0x31 0x7f 0xd2 0x8d 0x2b 0x8d
123
124
CRYPTOGRAPHICS (GLubyte (GLubyte (GLubyte (GLubyte (GLubyte (GLubyte (GLubyte (GLubyte
0x73 Oxba 0xf5 0x29 0x21 0xd2 0x60 0x2f
/* 9th round key */ ekey[144] = (GLubyte ekey[145] = (GLubyte ekey[146] = (GLubyte ekey[147] = (GLubyte ekey[148] = (GLubyte ekey[149] = (GLubyte ekey[150] = (GLubyte ekey[151] = (GLubyte ekey[152] = (GLubyte ekey[153] = (GLubyte ekey[154] = (GLubyte ekey[155] = (GLubyte ekey[156] = (GLubyte ekey[157] = (GLubyte ekey[158] = (GLubyte ekey[159] = (GLubyte
Oxac 0x19 0x28 0x57 0x77 Oxfa Oxdl 0x5c 0x66 Oxdc 0x29 0x00 0xf3 0x21 0x41 0x6e
/* 10th round key */ ekey[160] = (GLubyte ekey[161] = (GLubyte ekey[162] = (GLubyte ekey[163] = (GLubyte ekey[164] = (GLubyte ekey[165] = (GLubyte ekey[166] = (GLubyte ekey[167] = (GLubyte ekey[168] = (GLubyte ekey[169] = (GLubyte ekey[170] = (GLubyte ekey[171] = (GLubyte ekey[172] = (GLubyte ekey[173] = (GLubyte ekey[174] = (GLubyte ekey[175] = (GLubyte
OxdO 0xc9 Oxel 0xb6 0x14 Oxee 0x3f 0x63 0xf9 0x25 OxOc OxOc 0xa8 0x89 0xc8 0xa6
ekey[136] ekey[137] ekey[138] ekey[139] ekey[140] ekey[141] ekey[142] ekey[143]
for (i=0; i < 176; ++i) { for (j=0; j < 3; ++j) { rgba_ekey[i] [j] = ekey[i] ; } }
Appendix A: AES OpenGL Code for Encryption
125
} /* end of maketestekey */ /* helper function - performs 2 copies */ void add_layer(int dxl,int dyl,int sxl,int syl,int wl,int hi, int dx2,int dy2,int sx2,int sy2,int w2,int h2) { glRasterPos2i(dxl,dyl); glCopyPixels(sxl,sy1,wl,hi,GL.COLOR); glRasterPos2i(dx2,dy2); glCopyPixels(sx2,sy2,w2,h2,GL_COLOR);
/* encryption function */ void encrypt(void) { int r = 0; int ri = 0; int k; int key.ind = KEY_START_POS; int nuin_rnds = 9; int cnt=0; /* index used in print statements */ glDisable(GL_COLOR_LOGIC_OP); glPixelTransferi(GL_MAP_COLOR,0); /* load expanded key at (KEY_START_POS,0) NBLK copies (rows) of expanded key are needed */ for (k = 0; k < NBLK; ++k) { glRasterPos2i(KEY_START_P0S,k); glDrawPixels(EKEY.BYTES,1,GL.RGB,GL_UNSIGNED_BYTE,rgba.ekey); } // end of for k /* load data at (0,0) */ glRasterPos2i(0,0); glDrawPixels(BYTES_PER_BLK,NBLK,GL.RGB,GL.UNSIGNED.BYTE,dat /* perform first xor with key */ glEnable(GL_COLOR_LOGIC_OP); glLogicOp(GL_XOR); glRasterPos2i(0,0); glCopyPixels(KEY_START_POS,0,16,NBLK,GL.COLOR); glDisable(GL_COLOR_LOGIC_OP); /* start of round */ /* compute 1*,2*,3* Sbox of each byte */ for (r = 0; r < 9; ++r) { glPixelTransferi(GL_MAP_COLOR,1); glPixelMapfv(GL_PIXEL_MAP_R_T0_R,256,Tel); glPixelMapfV(GL_PIXEL_MAP_G_TO_G,256,Tel); glPixelMapfV(GL_PIXEL_MAP_B_TO_B,256,Tel); glRasterPos2i(192,0); /* destination of copy */ glCopyPixels(0,0,16,NBLK,GL.COLOR);
a);
126
CRYPTOGRAPHICS glPixelMapfV(GL_PIXEL_MAP_R_TO_R,256,Te2); glPixelMapfV(GL_PIXEL_MAP_G_TO_G,256,Te2); glPixelMapfV(GL_PIXEL_MAP_B_TO_B,256,Te2); glRasterPos2i(208,0); /* destination of copy */ glCopyPixels(0,0,16,NBLK,GL.COLOR); glPixelMapfV(GL_PIXEL_MAP_R_TO_R,256,Te3); glPixelMapfV(GL_PIXEL_MAP_G_TO_G,256,Te3); glPixelMapfv(GL_PIXEL_MAP_B_TO_B,256,Te3); glRasterPos2i(224,0); /* destination of copy */ glCopyPixels(0,0,16,NBLK,GL.COLOR); glPixelTransferi(GL_MAP_COLOR,0); /* turn mapping off */ /* create "TO[rowl]" */ /* 1st of 4 layers of 1st row 2* entry */ glRasterPos2i(0,0); glCopyPixels(208,0,4,NBLK,GL.COLOR); /* 1st of 4 layers of 2nd row 1* entry */ glRasterPos2i(4,0); glCopyPixels(192,0,4,NBLK,GL.COLOR); /* 1st of 4 layers of 3rd row 1* entry*/ glRasterPos2i(8,0); glCopyPixels(192,0,4,NBLK,GL.COLOR); /* 1st of 4 layers of 4th row 3* entry/ glRasterPos2i(12,0); glCopyPixels(224,0,4,NBLK,GL_COLOR); /* turn xor on */ glEnable(GL_COLOR_LOGIC_OP); glLogicOp(GL_XOR); /* create "Tl[row2]" */ /* 2nd of 4 layers of 1st row 3* entry */ add.layer(0,0,229,0,3,NBLK,3,0,228,0,1,NBLK); /* 2nd of 4 layers of 2nd row 2* entry */ add.layer(4,0,213,0,3,NBLK,7,0,212,0,1,NBLK); /* 2nd of 4 layers of 3rd row 1* entry */ add.layer(8,0,197,0,3,NBLK,11,0,196,0,1,NBLK); /* 2nd of 4 layers of 4th row 1* entry */ add.layer(12,0,197,0,3,NBLK,15,0,196,0,1,NBLK); /* create "T2[row3]" */
Appendix A: AES OpenGL Code for Encryption /* 3rd of 4 layers of 1st row 1* entry */ add_layer(0,0,202,0,2,NBLK,2,0,200,0,2,NBLK); /* 3rd of 4 layers of 2nd row 3* entry */ add_layer(4,0,234,0,2,NBLK,6,0,232,0,2,NBLK); /* 3rd of 4 layers of 3rd row 2* entry*/ add_layer(8,0,218,0,2,NBLK,10,0,216,0,2,NBLK); /* 3rd of 4 layers of 4th row l*entry */ add.layer(12,0,202,0,2,NBLK,14,0,200,0,2,NBLK); /* create "T3[row4]" */ /* 4th of 4 layers of 1st row 1* entry */ add.layer(0,0,207,0,1,NBLK,1,0,204,0,3,NBLK); /* 4th of 4 layers of 2nd row 1* entry */ add_layer(4,0,207,0,1,NBLK,5,0,204,0,3,NBLK); /* 4th of 4 layers of 3rd row 3* entry */ add.layer(8,0,239,0,1,NBLK,9,0,236,0,3,NBLK); /* 4th of 4 layers of 4th row 2* entry */ add.layer(12,0,223,0,1,NBLK,13,0,220,0,3,NBLK); /* xor with round key */ key_ind = key_ind + 16; glRasterPos2i(0,0); glCopyPixels(key.ind,0,16,NBLK,GL.COLOR); /* turn off XOR before starting the next round */ glDisable(GL_COLOR_LOGIC_OP); } /* end of for r */ /* last round Sbox, ShiftRows and XOR with round key */ glDisable(GL_COLOR_LOGIC_OP); /* SBox */ glPixelTransferi(GL_MAP_COLOR,1); glPixelMapfV(GL_PIXEL_MAP_R_TO_R,256,Te1); glPixelMapfv(GL_PIXEL_MAP_G_T0_G,256,Tel); glPixelMapfv(GL_PIXEL_MAP_B_T0_B,256,Tel); glRasterPos2i(192,0); /* destination of copy */ glCopyPixels(0,0,16,NBLK,GL.COLOR); /* ShiftRows */ glPixelTransferi(GL_MAP_COLOR,0); glRasterPos2i(0,0); glCopyPixels(192,0,4,NBLK,GL_COLOR);
127
128
CRYPTOGRAPHICS
add_layer(4,0,197,0,3,NBLK,7,0,196,0,l,NBLK); add.layer(8,0,202,0,2,NBLK,10,0,200,0,2,NBLK); add_layer(12,0,207,0,1,NBLK,13,0,204,0,3,NBLK); /* xor with round key */ glEnable(GL_COLOR_LOGIC_OP); glLogicOp(GL_XOR); key_ind = key_ind + 16; glRasterPos2i(0,0); glCopyPixels(key_ind,0,16,NBLK,GL.COLOR); /* read buffer to system memory */ // glReadPixels(0,0,BYTES_PER_BLK,NBLK,GL.RGB,GL_UNSIGNED_BYTE,out.data); /* Uncomment the above line to read all pixels to a single array which can then be written to a file. The following prints one row (since all blocks being encrypted are identical in this example, just check one row) of each pixel component so the user can verify the ciphertext. */ /* 1 line of each pixel color */ glReadPixels(0,0,16,1,GL.RED,GL_UNSIGNED_BYTE,out.red); for (ri=0; ri < 16; ++ri) {
printfC'/oX ", out_red[ri] ) ; } printf("\n"); glReadPixels(0,0,16,1,GL.GREEN,GL_UNSIGNED_BYTE,out_green); for (ri=0; ri < 16; ++ri) { printf ("'/oX " , out_green [ri] ); } printf("\n"); glReadPixels(0,0,16,l,GL_BLUE,GL_UNSIGNED_BYTE,out_blue); for (ri=0; ri < 16; ++ri) { printf ("7oX '•, out_blue [ri] ); } printf("\n"); } /* end of encrypt*/ void init(void) { /* dithering needs to be off Initialize all pixels to 0 */ glDisable(GL_DITHER); glClearColor(1.0,1.0,1.0,1.0); glClearDepth(l.O); /* to simplify indexing: set raster positions to correspond to pixels, 0,0 = lower left */ glMatrixMode(GL_PROJECTION);
Appendix A: AES OpenGL Code for Encryption glLoadldentityO ; gluOrtho2D(0.0,300.0, 0.0, 410.0); glMatrixMode(GL_MODELVIEW); glLoadldentityO ; glDrawBuffer(GL_FRONT); glReadBuffer(GL_FRONT); maketestdataO ; maketestekeyO ; glPixelStorei(GL_UNPACK_ALIGNMENT,1); } /* end of init */ void display(void) { glClear(GL_COLOR_BUFFER_BIT); encrypt 0 ; glFlushO; } /* end of display */ int main(int argc, char **argv) { const GLubyte *ver_str; glutlnit(&argc, argv); glutlnitDisplayMode(GLUT.SINGLEI GLUT.RGB); glutInitWindowSize(300,410); glutInitWindowPosition(50,10); glutCreateWindowC'aes") ; initO ; ver.str = glGetString(GL_VERSION); fprintf(stderr, "OpenGL version /.s \n" ,ver_str); glutDisplayFunc(display); glutMainLoopO ; return 0; }
129
References
[1] W. A. Arbaugh. Chaining Layered Integrity Checks. PhD thesis, University of Pennsylvania, Philadelphia, 1999. [2] W. A. Arbaugh, D. J. Farber, and J. M. Smith. A secure and reliable bootstrap architecture. In IEEE Security and Privacy Conference, pages 65-71, May 1997. [3] P. Biddle, M. Peinado, and D. Flanagan. Privacy, Security and Content Protection. http://download.microsoft.eom/download/a/f/c/ afcf8195-0eda-4190-a46d-aa60b45e0740/Secure.ppt. 14] E. Biham. A Fast New DES Implementation in Software. In Workshop on Fast Software Encryption (FSE), pages 260-272, 1997. [5] E. Biham and A. Shamir. Differential Fault Analysis of Secret Key Cryptosystems. Computer Science Technical Report CS0910, Technion, 1997. [6] Boneh, Demillo, and Lipton. On the Importance of Checking Cryptgraphic Protocols for Faults. In Proceedings of Advances in Cryptology - Eurocrypt, pages 37-51, 1997. [7] D. Boneh and N. Shacham. Improving SSL Handshake Performance via Batching. In Proceedings of the RSA Conference, January 2001. [8] I. Buck. BrookGPU. i n d e x . h t m l , 2003.
http://graphics.stanford.edu/projects/brookgpu/
[9] J. Butler and S. Sparks. Spy ware and Rootkits - The Future Convergence. USENIX ;login:, 29(6):8-15, December 2004. [10] C.Elliot. Vertigo. h t t p : / / w w w . c o n a l . n e t / V e r t i g o . [11] A. Carroll, M. Juarez, J. Polk, and T. Leininger. Overview. White paper, Microsoft, August 2002.
Microsoft Palladium: A Business
[12] N. Chou, R. Ledesma, Y. Teraguchi, and J. C. Mitchell. Client-Side Defense Against WebBased Identity Theft. In Proceedings of the ISOC Symposium on Network and Distributed Systems Security (SNDSS), February 2004.
132
REFERENCES
[13] M. Christodorescu and S. Jha. Testing Malware Detectors. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA), July 2004. [14] P. C. Clark. BITS: A Smartcard Protected Operating System. PhD thesis, George Washington University, 1994. [15] C. Coarfa, P. Druschel, and D. Wallach. Performance Analysis of TLS Web Servers. In Proceedings of the ISOC Symposium on Network and Distributed Systems Security (SNDSS), February 2002. [16] D. Cook, R. Baretto, and A. Keromytis. Remotely Keyed Cryptographies - Secure Remote Display Access Using (Mostly) Untrusted Hardware. In Proceedings of ICICS, pages 363-375, December 2005. [17] D. Cook, J. loannidis, A. Keromytis, and J. Luck. CryptoGraphics: Secret Key Cryptography Using Graphics Cards. In Proceedings of the RSA Conference, Cryptographer's Track (CT-RSA), pages 334-350, February 2005. [18] D. Coppersmith, et.al. The MARS Cipher, security/mars.html, 1999.
http://www.research.ibm.com/
[19] J. Daemon and V. Rijmen. The Design ofRijndael: AES the Advanced Encryption Standard. Springer-Verlag, Berlin, 2002. [20] D. Davis, F. Monrose, and M. K. Reiter. On User Choice in Graphical Password Schemes. In Proceedings of the 13*^ USENIX Security Symposium, pages 151-163, August 2004. [21] T. Dierks and C. Allen. The TLS protocol version 1.0. Request for Comments (Proposed Standard) 2246, Jan. 1999. [22] P. Druschel, M. Abbott, M. Pagels, and L. Peterson. Network subsystem design. IEEE Network, 7(4):8-17, July 1993. [23] P. Ekdahl and T. Johansson. A New Version of the Stream Cipher SNOW. In Proceedings of SAC, 2002. [24] W. Feghali, B. Burres, G. Wolrich, and D. Carrigan. Security: Adding Protection to the Network via the Network Processor. Intel Technology Journal, 6, August 2002. [25] R. Fernando and M. Kilgard. The Cg Tutorial. Addison-Wesley, 2003. [26] N. Galoppo, N. Govindoraju, M. Henson, and D. Manocha. LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware. In Proceedings of ACM/IEEE Super Computing Conference, 2005. [27] A. Goldberg, R. Buff, and A. Schmitt. Secure Web Server Performance Dramatically Improved By Caching SSL Session Keys. In Workshop on Internet Server Performance, held in conjunction with SIGMETRICS, June 1998. [28] V. Gupta, D. Stebila, S. Fung, S. C. Shantz, N. Gura, and H. Eberle. Speeding up Secure Web Transactions Using Elliptic Curve Cryptography. In Proceedings of the ISOC Symposium on Network and Distributed System Security (SNDSS), pages 231-239, February 2004.
REFERENCES
133
[29] P. Gutmann. The Design of a Cryptographic Security Architecture. In Proceedings of the 8*^ USENIX Security Symposium, August 1999. [30] P. Gutmann. An Open-source Cryptographic Coprocessor. In Proceedings of the 9*^ USENIX Security Symposium, August 2000. [31] H. Gobioff and S. Smith and J. Tygar and B. Yee. Smart Cards in Hostile Environments. In 2"^"^ USENIX Workshop on Electronic Commerce, 1996. [32] Helion Technology Limited. High Performance Solutions in Silicon, AES (Rijndael) Core, http://www.heliontech.com/core2.htm, 2003. [33] Y.-C. Hu, A. Perrig, and D. B. Johnson. Paclcet Leashes: A Defense against Wormhole Attacks in Wireless Networks. In Proceedings of IEEE Infocomm, April 2003. [34] N. L. P. Jr., T. Fraser, J. Molina, and W. A. Arbaugh. Copilot - a Coprocessor-based Kernel Runtime Integrity Monitor. In Proceedings of the 13*^ USENIX Security Symposium, pages 179-194, August 2004. [35] J. Kay and J. Pasquale. The Importance of Non-Data Touching Processing Overheads in TCP/IP. In Proceedings ACM SIGCOMM Conference, pages 259-269, September 1993. [36] J. Kelsey, B. Schneier, D. Wagner, and C. Hall. Side Channel Cryptanalysis of Product Ciphers. Journal of Computer Security, 8(2-3):141-158, 2000. [37] S. Kent and R. Atkinson. Security Architecture for the Internet Protocol. Request for Comments (Proposed Standard) 2401, Internet Engineering Task Force, Nov. 1998. [38] A. D. Keromytis, J. L. Wright, and T. de Raadt. The Design of the OpenBSD Cryptographic Framework. In Proceedings of the USENIX Annual Technical Conference, pages 181-196, June 2003. [39] J. Kessenich, D. Baldwin, and R. Rost. The OpenGL Shading Language Version 1.10. h t t p : //www. opengl. org, April 2004. [40] P. Kocher. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS and Other Systems. In Proceedings of Advances in Cryptology - Crypto, pages 104—113, 1996. [41] D. KoUer, M. Turitzin, M. Levoy, M. Tarini, G. Croccia, P. Cignoni, and R. Scopigno. Protected Interactive 3D Graphics Via Remote Rendering. In Proceedings of ACM SIGGRAPH, 2004. [42] H. Kuo and I. Verbauwhede. Architectual Optimization for 1.82 Gbits/sec VLSI Implementation of Rijndael Algorithm. In Proceedings ofCHES, pages 51-64, 2001. [43] X. Lai and J. Massey. A Proposal for a New Block Encryption Standard. In Proceedings ofEUROCRYPT1990, pages 389-404, 1991. [44] E. Levy. Interface Illusions. IEEE Security & Privacy, 2(6):66-69, November/December 2004. [45] A. Lutz, J. Treichler, F. Gurkeynak, H. Kaeslin, G. Bosler, A. Erni, S. Reichmuth, P. Rommens, S. Oetiker, and W. Fichtner. 2G bits/s Hardware Realizations of Rijndael and Serpent: A Comparative Analysis. In Proceedings ofCHES, pages 144-158, 2002.
134
REFERENCES
[46] M. Abadi and M. Burrows and C. Kaufman and B. Lampson. Authentication and Delegation with Smart-cards. In Theoretical Aspects of Computer Software, 1991. [47] M. Macedonia. The GPU Enters Computing's Mainstream. IEEE Computer Magazine, pages 106-108, October 2003. [48] J. McCune, J. Perrig, and M. Reiter. Bump in Ether: Mobile Phones as Proxies for Sensitive Input. Computer Science Technical Report CyLab-05-007, Carnigie Mellon University, 2005. [49] J. P. McGregor and R. B. Lee. Protecting Cryptographic Keys and Computations via Virtual Secure Coprocessing. In Proceedings of the Workshop on Architectural Support for Security and Anti-virus (WASSA), pages 11-21, October 2004. [50] M. McLoone and J. McConny. High Performance Single Chip FPGA Rijndael Algorithms Implementations. In Proceedings ofCHES, pages 65-76, 2001. [51] Microsoft. Microsoft DirectX, default.aspx.
http://www.microsoft.com/windows/directx/
[52] Microsoft. Windows 9 Media Series Digital Rights Management. microsoft.com/windows/windowsmedia/drm.aspx.
http://www.
[53] S. Miltchev, S. loannidis, and A. D. Keromytis. A Study of the Relative Costs of Network Security Protocols. In Proceedings of USENIX Annual Technical Conference, Freenix Track, pages 41-48, June 2002. [54] J. Nieh, S. J. Yang, and N. Novik. Measuring Thin-Client Performance Using SlowMotion Benchmarking. ACM Transactions on Computer Systems (TOCS), 21(1):87-115, Feb. 2003. [55] NIST. PIPS 46-3 Data Encryption Standard (DES), 1999. [56] NIST. PIPS 197 Advanced Encryption Standard (AES), 2001. [57] Nvidia. GPGPU Presentation, 2005. [58] OpenGL Organization. OpenGL. h t t p : //www. o p e n g l . org, 2005. [59] G. Organization. General Purpose Computation Using Graphics Hardware, h t t p : / / www.gpgpu.org. [60] D. Osvik, A. Shamir, and E. Tromer. Cache Attacks and Countermeasures: The Case of AES. In Proceedings ofRSA Conference Cryptographers Track (CT-RSA), 2006. [61] P. Rogaway. A Software Optimized Encryption Algorithm, pages 273-287, 1998. [62] M. Pharr, editor. GPU Gems2. Addison-Wesley, 2005. [63] C. Pu, H. Massalin, J. loannidis, and P. Metzger. The Synthesis System. Systems, 1(1), 1988.
Computing
[64] R. lannella. Digital Rights Management (DRM) Architectures. D-Lib Magazine, 1(6), June 2001.
REFERENCES
135
[65] V. Rijmen, A. Bosselaers, and P. Barreto. AES Optimized ANSI C Code, h t t p : //www. e s a t . k u l e u v e n . a c . b e / ~ r i j m e n / r i j n d a e l / r i j n d a e l - f s t - 3 . 0 . z i p , 2002. [66] Rivest, Robshaw, Sidney, and Yin. RC6 Block Cipher, http://www.rsasecurity. com/rsalabs/node.asp?id=2512, 1998. [67] R. Rivest. The RC5 Encryption Algorithm. CryptoBytes, 1(1), 1995. [68] G. Rose. A Stream Cipher Based on Linear Feedback Over GF (28). In Information Security and Privacy LNCS 1438, page 135ff, 1998. [69] V. Roth, K. Richter, and R. Freidinger. A PIN-Entry Method Resilient Against Shoulder Surfing. In Proceedings of the 11* ACM Conference on Computer and Communications Security (CCS), pages 236-245, October 2004. [70] RSA Laboratories. PKCS #7.- RSA Encryption Standard, Version 7.5, November 1993. [71] C. B. S. and J. M. Smith. Hardware/Software Organization of a High-Performance ATM Host Interface. IEEE Journal on Selected Areas in Communications (Special Issue on High Speed Computer/Network Interfaces), 11 (2):240-253, February 1993. [72] R. Sailer, X. Zhang, T Jaeger, and L. van Doom. Design and Implementation of a TCGbased Integrity Measurement Architecture. In Proceedings of the 13*^ USENIX Security Symposium, pages 223-238, August 2004. [73] S. Saroiu, S. D. Gribble, and H. M. Levy. Measurement and Analysis of Spyware in a University Environment. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI), March 2004. [74] B. K. Schmidt, M. S. Lam, and J. D. Northcutt. The Interactive Performance of SLIM: A Stateless, Thin-Client Architecture. In Proceedings of the 17*^ ACM Symposium on Operating Systems Principles (SOSP), pages 32-47, Kiawah Island Resort, SC, December 1999. [75] M. Segal and K. Akeley. The OpenGL Graphics System, A Specification, Version 2.0. h t t p : //www. opengl. org, SiliconGraphics, Inc., October 2004. [76] A. Shamir and E. Tromer. Acoustic Cryptanalysis On Nosy People and Noisy Machines. Eurocrypt rump session presentation, 2004. [77] M. Shirase and Y. Hibino. An architecture for elliptic curve cryptograph computation. In Proceedings of the Workshop on Architectural Support for Security and Anti-virus (WASSA), pages 120-129, October 2004. [78] Simpson, Dawson, Golic, and Millar. LILI Keystream Generator. In Selected Areas in Cryptology, LNCS 2012, page 248ff, 2000. [79] J. M. Smith and C. B. S. Traw. Giving Applications Access to Gb/s Networking. IEEE Network, 7(4):44-52, July 1993. [80] J. M. Smith, C. B. S. Traw, and D. J. Farber. Cryptographic Support for a Gigabit Network. In Proceedings oflNET, pages 229-237, June 1992. [81] S. Smith. Magic Boxes and Boots: Security in Hardware. IEEE Computer, 37(10): 106109, October 2004.
136
REFERENCES
[82] C. Thompson, S. Hahn, and M. Oskin. Using Modern Graphics Architectures for GeneralPurpose Computing: A Framework and Analysis. In 35*^ Annual IEEE/ACM International Symposium on Micro Architecture - MICRO-35, pages 306-317, 2002. [83] J. Thorpe and P. C. van Oorschot. Graphical Dictionaries and the Memorable Space of Graphical Passwords. ]In Proceedings of the 13*^ USENIX Security Symposium, pages 135-150, August 2004. [84] Trusted Computing Group. TCG Specification Architecture Overview, version 1.2. h t t p s : //\j\j\j. trustedcomputinggroup. org/home, April 2004. [85] J. Tygar and B. Yee. DYAD: A System for Using Physically Secure Coprocessors. Technical Report CMU-CS-91-140R, Carnegie Mellon University, May 1991. [86] Veritest. i-Bench version 1.5, Ziff-Davis, Inc, 2004. http://www.veritest.com/ benchmarks/i-bench/. [87] T. J. Walsh and D. R. Kuhn. Challenges in Securing Voice over IP. IEEE Security & Privacy Magazine, 3(3):44-49, May/June 2005. [88] S. Wasson. NVIDIA's GeForce 7800 GTX graphics processor. The Tech Report, h t t p : / / t e c h r e p o r t . com, June 2005. [89] M. Woo, J. Neider, T. Davis, and D. Shreiner. The OpenGL Programming Guide, S^'^ edition. Addison-Wesley, 1999. [90] Z. Ye, S. Smith, and D. Anthony. Trusted Paths for Browsers. ACM Transactions on Information and System Security (TISSEC), 8(2):153-186, May 2005. [91] B. Yee. Using Secure Coprocessors. PhD thesis, Carnegie Mellon University, 1994. [92] Q. Yu, C. Chen, and Z. Pan. Parallel Genetic Algorithms in Programmable Graphics Hardware. In Proceedings oflCNC, pages 1051-1059, 2005.
About the Authors
Debra Cook is a Ph.D. student with the Department of Computer Science at Columbia University in New York. She is completing her doctorate in 2006. Her research interests are focused in applied cryptography and security. She has a B.S. and M.S.E. in mathematical sciences from the Johns Hopkins University in Baltimore, Maryland and a M.S. in computer science from Columbia University. After graduating from Johns Hopkins, she was a senior technical staff member at Bell Labs and AT&T Labs before pursuing her Ph.D. Angelos Keromytis is an Associate Professor of Computer Science at Columbia University in New York. His research interests include design and analysis of network and cryptographic protocols, software security and reliability, and operating system design. He received his Ph.D. and M.Sc. in computer science from the University of Pennsylvania, Philadelphia, PA in 200 L He received his B.S. in computer science from the University of Crete, Heraclion, Greece in 1996.
Index
AES, 27, 34, 39-^1, 48--64, 105 experiments, 58-64 key schedule, 52 OpenGLcode, 107-129 OpenGL implementation, 53-58 asymmetric key ciphers, 38 block ciphers, 24, 34, 40, 82, 99 BrookOPU, 17 Cg, 17 cryptographic accelerators, 25 data compression and CPUs, 97 DES, 34, 42 differential fault analysis, 33-35 Diffie-Hellman, 39 digital rights management, 29 digital signal processors, 101, 106 Direct3D, 17 elliptical curve cryptography, 39 GLUT, 18,58,62, 82 GLX, 62 GPU, 9-24 APIs, 17 architecture, 10 pixel processor, 10 vertex processor, 10 GPUs and general purpose programming, 15, 23 graphical keypad, 90, 106 graphics based stream cipher, 99, 105 keying of GPUs, 69, 90 experiments, 82
remote keying protocol, 75 MAC, 82 malware, 28, 30, 32, 42, 69, 90, 93 man-in-the-middle attack, 93 modes of encryption, 45-48, 105 OpenGL, 12, 16-22, 48, 78 phishing, 28, 94 pixel processing, 10, 12, 15, 19-22 projects, 105 RC4,43,79, 81 RC6, 42 remotely keyed CryptoGraphics, 69 RSA, 39, 80 side channel attacks, 33-35 spy ware, 28, 30, 32, 37, 71, 87, 96, 97 stream ciphers, 40, 44 experiments, 64-67 symmetric key ciphers, 40 thin-clients, 28, 69, 83 Trusted Computing Group, 29, 95 trusted platform module, 30, 95 untrusted clients, 69 user input - protecting, 89 vertex processing, 10, 13, 22 Vertigo, 17 video conferencing, 28, 69, 83 window toolkits - wrappers for APIs, 19