Itanium Architecture for Software Developers Itanium Architecture for Software Developers by Walter Triebel Intel Press ...
58 downloads
1105 Views
4MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Itanium Architecture for Software Developers Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Itanium Architecture for Software Developers Walter Triebel INTEL PRESS Copyright © 2000 Intel Corporation. All rights reserved. ISBN 0-9702846-4-0 Published by Intel Press in USA. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL CORPORATION MAY HAVE PATENTS OR PENDING PATENT APPLICATIONS, TRADEMARKS, COPYRIGHTS, OR OTHER INTELLECTUAL PROPERTY RIGHTS THAT RELATE TO THE PRESENTED SUBJECT MATTER. THE FURNISHING OF DOCUMENTS AND OTHER MATERIALS AND INFORMATION DOES NOT PROVIDE ANY LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY SUCH PATENTS, TRADEMARKS, COPYRIGHTS, OR OTHER INTELLECTUAL PROPERTY RIGHTS. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. Intel may make changes to specifications, product descriptions, and plans at any time,
without notice. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. Intel is a registered trademark of Intel Corporation. MMX, Pentium, Pentium II, Pentium III, i960, i860, Intel486, Pentium Pro, Celeron, and Itanium are trademarks of Intel Corporation. †All other names and brands are the property of their respective owners. This book is printed on acid-free paper. Publisher: Rich Bowles (Intel) Editor: David Spencer Interior Design: Liah Rose Composition: Interactive Composition Corporation Cover: Mandish Designs, Inc. Illustrations: Jerry Heideman and Associates, Interactive Composition Corporation Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 Acknowledgments Like most Intel® products, this book was a team effort. Without these contributors and reviewers, I couldn’t have produced the book. However, all errors are my own special province. In particular, two key contributors helped me develop the text at crucial points. Sathya Prasad was my source for much of the initial text and sample code. His input was a powerful jumpstart. James Reinders provided the careful guidance that turned the first draft into the final book. He wrote the foreword and made original contributions to Chapters 1, 7, and 10. In the tradition of scientific writing, I built on the foundation laid by other Intel documents. My primary content source for technical details was the IA-64 Application Developer’s Architecture Guide,[1]a comprehensive description of the Itanium architecture by the development team members at Intel and Hewlett-Packard. For allowing me to use their work, special thanks to authors Rumi Zahir (Intel), Allan Knies (Intel), Gautam Doshi (Intel), Drew Hess (Intel), Dale Morris (HP), Jim Hull (HP), Alejandro Quiroz (HP), Michael Morrell (HP), and to team leaders Hans Mulder (Intel) and Jerry Huck (HP). Chapter 11 “Compiler Technology for Itanium™-based Software” is based on two articles by members of the Intel compiler development team. For use of material from “An Overview of the Intel IA-64 Compiler,”[2] thanks to Carole Dulong and co-authors Rakesh
Krishnaiyer, Dttatraya Kulkarni, Daniel Lavery, Wei Li, John Ng, David Sehr. Additional information about profile-guided optimization came from two articles[3] by Jim Pierce, who also reviewed the chapter on behalf of the many compiler engineers in Intel’s Microcomputer Software Labs. From course material created by Judy Lee and provided by Ray Asbury, training manager for Itanium architecture, I gained many helpful insights into the architecture and the assembly code. Bob Norin, at Intel’s Computational Software Lab, supplied source material for divide and square root computations, and sample code sequences for both. Reviewers of this book influenced it from the outline to the final draft. Brian Nickerson, in IACG’s Tools and Application Engineering group, diligently reviewed every page of the final draft and greatly increased the technical accuracy. Geoff Murray of Intel’s Itanium processor development team also helped look for errors. Jason Waxman, Itanium processor launch manager, guided me to a more correct view of the product. James D. Howard, specialist in application porting and tuning at Intel’s ISV Performance Labs, counseled me on meeting the needs of software developers. Industry experts who offered advice and encouragement are Tom Cantwell at Dell Computer and Mike Kavanaugh at Oracle’s Itanium processor Program Lab. [1]Originally
published in May 1999 and now out of print.
[2]“Optimizing
Your Code for IA-64,” published in the Intel Technology Journal, Q4 1999 and “Optimizing Your Code for IA-64” forthcoming in Microcomputer XXX. [3]Published
in the Intel Developer UPDATE Magazine, March 2000.
TeamUnknown Release
Foreword Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Foreword Explicitly parallel instruction computing (EPIC) is a new architecture, a new fundamental basis for microprocessor designs from Intel. EPIC is the result of more studies at Intel and HP, more collaboration, and more engineering than any prior microprocessor. This book explains the result of that effort and why it matters. Without the benefit of understanding the problems that face a modern day microprocessor designer and a compiler writer, one cannot imagine how engineers would dream up this architecture. Equally intriguing is the fact that Intel, after a couple of years of development, abandoned a much more obvious and seemingly viable approach to getting to market with a 64-bit microprocessor. Yes, the Itanium processor could have happened years ago, but without EPIC. It would not have been the Itanium architecture we know today. In a company which prides itself on engineering excellence and time-to-market, the conflicts are obvious. What was so compelling about EPIC that Intel came to conclude that delaying a move to 64 bits until a radical new architecture could be used? As you might expect, the technical reasons that would motivate a company like Intel to invest billions and delay the move to 64-bits for a few years are not easily told in even a few pages. Here’s the technical story, behind Itanium architecture.
HOW EPIC BECAME THE CHOICE FOR 64-BITS Intel® microprocessors have been among the most popular micro-processors for a long time. By the early 1990s, many microprocessor architecture proposals were being made for 64-bit microprocessors. The demand in the largest market segments was clearly not there; the 1990s would see enormous growth in sales of 32-bit systems. But in some market segments, clear demand for 64-bit systems would grow by the end of the decade, both for engineering workstations and servers for e-Commerce. Engineers at HP and Intel independently looked at the problem of how to design 64-bit microprocessors which would bring along existing customers and hopefully pick up new
ones, as they focused attention on their options for 64-bit systems. HP engineers looked at how to expand their 32-bit reduced instruction set computing (RISC) machines into a 64-bit machine. They updated their PA-RISC specification and proceeded to design and produce 64-bit RISC processors by the mid-1990s. For Intel engineers who were tackling the problem of making microprocessors to run IA32 complex instruction set computing (CISC) instruction sets faster and faster, the answer was for the microprocessors to convert CISC instructions into micro-ops that resembled RISC instructions, and to build processor cores that executed these microops. Thus, the high-level design concepts behind the cores of our processors did not differ radically from the design concepts behind other microprocessors (the particulars and the results might vary—but the high-level design concepts were the same). Maybe we should have taken the hint, and realized that RISC offered no advantage for the Itanium processor that we didn’t already have on the drawing board for IA 32 processors. When engineers at Intel started with a “clean sheet of paper” in 1991, it was no surprise that two schools of thought emerged: one to extend CISC IA-32 instructions to support 64-bits, and another to introduce a cleaner RISC style instruction set with the 64-bit support (while maintaining the old instructions for compatibility). While we settled on the RISC approach for a new 64-bit architecture, the attractive idea of extending IA-32 for 64bit addressing was revisited several times within Intel. It turns out that, for all the marketing charm that a simplistic idea like adding 64-bit addressing to IA-32 holds, in reality that approach would not yield the long-term value we needed. We didn’t figure it out right away, but the RISC approach faced the same reality. Intel embarked on a lot of studies, and eventually settled down to design a 64-bit processor using all the best ideas, building the new architecture from the ground up to execute a 64-bit RISC instruction set. It turns out that the simulator for this project, and the studies we did on it, saved Intel a lot of money in the long run. Our studies eventually convinced us that our RISC approach was a good one, but not a great one. Ultimately, our ideas for a new architecture offered nothing we couldn’t get out of either our own CISC or others’ RISC architectures with enough effort. It was time to face reality, or stick with a project that would launch a successful but non-revolutionary new architecture. Many at Intel felt we could do better. Those who felt we could do better became convinced that engineers at HP-Labs who felt the time for VLIW-style microprocessors had come, were right. The theoretical promise was that these would be superior to CISC or RISC style designs for microprocessors. The key is to take the base belief from VLIW, that an architecture should be designed to maximize compiler to hardware communication, and build upon it. It was a radical idea. But others worried that 64-bits would be needed in the new millennium, and they saw time slipping away. The desire to barrel ahead with whatever we could bring to market most quickly was hard to argue against. In this context, would any company drop a 64-bit
RISC style design that was well underway in favor of a radical new engineering approach? It certainly meant several years of delay in launching a 64-bit processor, and embarking on a very complex and unproven design. But we knew once we shipped one type of 64-bit machine, we must commit to it totally. The decision was made by late 1994 to cancel our RISC style approach, and invest in the much more radical approach, to define EPIC and launch our new effort we called Merced.
RISC vs. CISC The term “load/store architecture” simply describes a machine that only uses instructions to load memory to registers, instructions to STORE memory from registers, or instructions to operate on register data. No mixing. At first, microprocessors in general were not LOAD/STORE. After a while, as computers got faster, the variety of instructions greatly complicated the life of computer designers. A revolt was in order and the revolt labeled the old way of doing things as complex and therefore “bad.” The term Complex Instruction Set Computing (CISC) emerged and stuck. CISC had deep roots in computer design, long before microprocessors. A single instruction might take a handful of clocks to execute, or thousands. The time spent on a single instruction depended upon the work it was doing, whether it was adding two numbers, computing the square root, or maybe even formatting a floppy disk. With CISC as your mindset, any of these were possible. Computer designers advanced the idea that, if we made each instruction simpler, we could concentrate on making computers faster. So we come to the first principle of making computers faster: make each instruction faster. The term Reduced Instruction Set Computing (RISC) was used as the label for this concept. This is where RISC vs. CISC stopped being simple. In the early days, it was easy to promise that RISC would stay simple and good, and that CISC would stay complex and evil. Confusion from several sources made the determination of which is better, RISC or CISC, much more complicated than it first seemed. The early and purest RISC designs decided to execute new instructions, which forced the users to write new code. Designers promised faster computers both because the microprocessor designs would be faster and because compilers would be easier to write well. CISC was pronounced dead by many, and RISC promised a better life for all (especially the computer designers). A lot of code was already written to use popular CISC instruction sets. The Intel® 80286 and the Intel386™ processors were very popular CISC processors by the time the RISC concepts gained widespread attention. The designers at Intel and other CISC microprocessor vendors recognized the obvious advantages of a LOAD/STORE architecture. Intel actually worked on five microprocessors after the Intel386, representing the full breadth of choices:
The i960® processor was among the most popular of all RISC chips. The i860 processor was a RISC/LIW hybrid. A research project in conjunction with CMU, called iWarp, was a RISC/VLIW processor. And the i432 processor was an object-oriented processor, perhaps the most CISC microprocessor ever. The most popular microprocessor, as determined by Intel’s customers, was the Intel® 486 processor. The Intel 486 processor started Intel down the road of incorporating RISC-style simplifications to the core of the processor without changing the instruction set. This option to get the faster microprocessors without abandoning the programs you already had was a hit.[1] Two generations later, when Intel introduced the P6-microarchitecture, the battle was over. All the benefits of RISC were present in the core of the processor, but full compatibility with earlier Intel processors had not changed. Customers kept their code. There was a price to pay; the microprocessor had to translate CISC instructions into RISC-like micro-ops that were executed inside the processor. This translation was done by a special part of the microprocessor, which made the microprocessor bigger and more complex to design. In the end, this was a small price for pay once in a design instead of having all code in the industry switch to a new instruction set. RISC had other challenges; CISC instruction sets had some advantages. For one thing, CISC code was smaller. RISC processors had a handicap in that they tended to have to fetch more instructions from memory, and that meant something else was needed to make up for the loss of performance the extra instruction fetching would cause. Another problem was that RISC couldn’t stay pure. Floating-point operations (for noninteger math) were needed. Once these complex instructions are incorporated into RISC cores, those simple and good processors start to look more like CISC processors. At the same time, the core of the CISC processors became like the core of RISC processors, leaving different instruction sets to use in programs as the distinction. CISC had a much bigger installed base, which was willing to pay to keep it running. RISC did not maintain a lead. By the time the Intel® Pentium® Pro launched in late 1995, Intel started what many had thought impossible: a back and forth game of who’s faster, the best RISC (DEC Alpha) or the best CISC (Intel Pentium Pro processor). My point here is not that Intel engineers are geniuses. The real point is that the difference between RISC and CISC instruction sets did not matter enough. This point did not sink in at Intel as quickly as you might think, but this idea eventually led to the conclusion that something radical was needed if you wanted more performance—and RISC over CISC was not the answer. Hint: EPIC is the answer!
Parallel Semantics Make EPIC Work EPIC is about parallelism. EPIC means Explicitly Parallel Instruction Computing. Unlike earlier parallel (VLIW) designs, EPIC does not use a fixed width encoding. Instead, instructions can be combined to operate in parallel from one to as many instructions as desired. In order to make this variable width work, one simple rule is imposed, which earlier parallel design did not impose. Programs must be written to work assuming sequential semantics. Consider a program to swap data: TMP = A; A=B; B=TMP In a machine with parallel semantics, a program can simply tell the machine to execute A=B and B=A in parallel. With EPIC, you may not do this because sequential semantics would not swap the data (because A=B and B=A in parallel means something different than doing A=B then doing B=A). This simple restriction makes it possible to build machines with different levels of parallelism, to execute the same programs and get the same results. An important objective of EPIC is to do exactly that. The rest of the new features in EPIC are simply designed to allow a compiler to produce code to use all this parallelism. In fact, so called “compiler to hardware communication” is a principal objective of EPIC. There are the three key concepts that we need to understand, in order to appreciate the instruction set explanations in the remainder of this book: Speculation Predication (and Parallel Compares) Rotating register files [1]Designing
wildly different computers over the years to run the same instructions was not a new concept. A very significant example was the IBM 360 series of computers, which ran the same programs for many years on computers with wildly different designs.
TeamUnknown Release
Foreword Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
HOW EPIC WORKS EPIC brings a lot of new things, but the biggest are explicit instruction-level parallelism, speculation, predication, and a large register file. Each of these makes a computer run faster. The faster you make a computer run, the harder it is to envision how to make it faster yet. At a very simplistic level, there are only two ways to make a computer go faster: Principle 1: Make it do each thing faster. Principle 2: Make it do more things at once. Through these, a computer gets things done more quickly and we call that “faster.” As they say, the devil is in the details. Neither running fast, nor juggling lots of things at once is simple. A recurring theme is parallelism. In a way, this feels like cheating—the computer is not running faster; it is simply getting more done at once. But what do you do with all this parallelism? The traditional problem is that we never have enough work ready in order to keep a machine fully busy. The wider it gets, the less likely we can use that power all the time. A breakthrough happens when you decide to stop worrying about only doing things we must. If we have all this power—the key is to not throw it away by letting it go unused. Anytime the processor is ready to do six more things, we don’t want to give it two things to do and ignore the ability to do four more. That would be a 66% waste of our processing power. Instead, we speculatively ask the machine to do more things. We pick things that might be needed in the future. We just aren’t sure whether they will be needed at the time. Let’s say that the extra things turn out to be needed only half the time. The other half of the time the effort to do that work was simply wasted. We are better off than if we had asked to do nothing because half of the time we have the work done by the time we are sure it needs to be done.
In fact, this is the promise of EPIC. As machines get wider and wider, we just speculate more and more. The goal is to compute things before they are needed, so that when a program needs them, the answer is already there. Nothing is faster than something that is already done! In a traditional microprocessor, doing things in advance can get us into trouble. For instance, we might find that if we compute A+B too early, the values of A or B might change. Since we will have to compute A+B again before we use it, the original computation turns out to have been useless. Because of these problems, the architecture needs a lot of special features for speculation that make these exceptions efficient to detect and correct. Speculation is a big part of what EPIC offers, and speculation comes in a bewildering number of forms. But, all the speculation features in EPIC simply allow us to compute things before we are sure they are needed, and before we check to make sure it is not too early to do the work. Compilers support this most important concept for EPIC. The architecture is optimized to simplify the task of writing the very complex compilers needed for EPIC. The first thing EPIC offers to compilers is explicit access to the parallelism—hence the term “Explicitly” in EPIC. This access is a bonus for the compiler writer and the microprocessor designer. Both jobs are simplified by opening up the communication between compiled code and the hardware. Once we have explicit parallelism available, the challenge is using it. Speculation is the most important concept in EPIC for helping a compiler fill up the available parallelism. Speculation makes it possible to believe that we can write programs to use the parallelism available. More importantly, it is what makes compiler writers believe they can compile our programs into parallel code for the processor. Compilers that harness the instruction level parallelism are very complex. So speculation, predication, and a large register file simplify the compiler writers’ problems sufficiently for them to write good compilers to use EPIC. Predication is the second most important concept in EPIC for filling up the available parallelism. Practically every instruction that the processor executes is conditional, based on a predicate register. Each instruction either has an effect, or has no effect, based on the true/false condition in a predicate register. This predication of every instruction with no extra run-time cost allows very simple branch elimination, letting the compiler writer concentrate on finding parallelism. EPIC also offers very large register sets. Smaller register sets would limit performance because the processor has to shuffle data in and out of registers instead of doing work directly required for the program being run. The registers even offer an esoteric feature called rotating registers to reduce the need for register shuffling and code duplication when doing a type of loop level parallelism called software pipelining.
Finally, EPIC acknowledges that the parallelism that a compiler can find is not consistent from clock to clock, and machine width is not fixed for every microprocessor design. For example, the Itanium processor is six wide, but another processor might only be three wide. It cannot do every combination of six things at once, but it can do many of them. Future processors might be more flexible about what they can do in parallel or how many things they can do in parallel, or both. EPIC allows code to express where the parallelism is, without forcing all code to be six wide forever, even though the Itanium processor is six wide. In early VLIW machines, the width changes from generation to generation of machines caused a substantial problem in designs, and the code compiled for one machine did not work on another. EPIC solves this problem so that code can execute across all processor variations in the Itanium architecture.
Principle 1: Make It Do Each Thing Faster In pursuit of making computers run single instructions faster (“Make it do one thing faster”), we’ve looked at pipelining, caches, load/store architecture designs and RISC vs. CISC. For this pursuit, we needed one last trick: branch prediction. One thing that computer programs do regularly is make a decision, and go to different pieces of code (instructions) depending on that decision. Branch instructions make us “go” to somewhere else if the decision is to branch, or execute the next instruction (“fall through the branch”) if the decision is not to branch. Branching is generally bad because it messes up our nice model of reading one instruction while executing the previous instruction. If that last instruction was the branch, we find ourselves in a dilemma: we do not know where to get the next instruction in time to do the branching operation. We might do the first thing that comes to mind: wait until we know before fetching the next instruction. This wait attitude helps no one. It merely ensures that we get nowhere during the clocks that we wait.
Branch Elimination, or Predication Once we know that branches cause problems, it is natural to want to eliminate them. Instead of branching around instructions that we do not want to execute, we simply predicate all instructions and provide FALSE predicates in order to turn off the instructions that we do not want to execute (those which we would have branched around in other architectures). EPIC was designed for virtually every instruction to be predicated. Earlier architectures have introduced some limited instructions that were predicated. Just such a capability was added to many microprocessors, such as the Intel Pentium Pro, Pentium II, Pentium III, and Intel® Celeron™ processors. Such features turned out to be very hard to use,
mostly because the processors did not offer enough parallelism for compilers and programmers to be confident that using the new instructions would not slow a program down more than branching, especially if the branch was easy to predict. The concept of predication makes a lot more sense in EPIC, where it is combined with the ability to execute a lot of instructions at once (parallelism). This combination is an excellent lead-in for our next topic— principle 2.
Principle 2: Make It Do More Things at Once We haven’t exactly abstained from discussing this principle so far. For many recent computers, we fetched instructions while still executing previously fetched instructions, which is doing more things at once in a sense. However, we call this pipelining and not parallel execution. The reason is simple: each instruction is in a different part of its execution at any point in time. Each starts by itself; each ends by itself. If we start multiple instructions at the same time, we call it parallel execution. The net effect is similar, the computer is faster because it is getting more work done. In a modern computer, loading data as early as possible is very useful for speeding up systems because, when a program loads data, waiting must occur if the data is not in cache and must be retrieved from memory. This potential for delay leads to an obvious desire to do load instructions as early as possible, preferably so early that even waiting for memory is not a problem. Realistically, we can hope to eliminate at least some of the delays for memory by moving a LOAD earlier. Several things limit how early we can speculate a load from memory and have it provide better performance. In order to avoid these problems, EPIC specifically provides three key features: Lots of registers (128 for integer values, 128 for floating point values and 64 for Boolean values “predicates”) which avoid problems with needing to hold more values in registers longer since we are loading earlier. Speculative loads, which allow us to request memory loads before a branch that may or may not direct the program to where the load would normally be executed. The key feature of a speculative load is that it will not cause a program to abort if the request is invalid (which it very well may be since the branch may prevent the program from executing the code from which the load came).
Speculation Advanced loads, which allow us to request memory loads before a store that may or may not modify the memory from which the load read data. The key feature of an advanced load is the ability to detect later that the data loaded was overwritten, and therefore
needs to be reloaded (and any computations based on the wrong data to be redone). An advanced load can be used speculatively (move before a previous branch) as well as being moved before a store.
Predication Practically every instruction is written with a specified predicate register to control whether the instruction executes at run time. One predicate register is hardwired to always be true so that instructions which always should be executed are actually coded with this register as their predicate control. Therefore, the compiler codes instructions that appear not to be predicated in the assembly language to use the predicate register, which is always true. Predication allows instructions to be executed in parallel, with some on and some off, but without need of branches. The elimination of branches is beneficial, and the resulting predicated code can often be packed tightly with non-predicated code to more than make up for the fact that instructions with false predicates are using up available parallelism in the machine to do nothing.
Parallel Compares Predication of instructions by predicate registers means that we need to compute the true/false values to be placed in the predicate registers. A notable feature of the compare instructions is the ability to compute and combine comparison operations in parallel. Common expressions like ((A > B) and (B < 0)) can be computed in parallel using parallel compare instructions, instead of the obvious multiple step process of computing the results of the comparison operations and then combining the operations.
Rotating Register File EPIC provides a bunch of looping constructs. Combining these with the parallelism naturally leads one to do software pipelining in this architecture. Software pipelining is a technique which re-codes loops to execute code from multiple iterations in parallel with each other. Software pipelining normally leads to extra special code in front of a loop (prologue) and after the loop (epilogue) to deal with priming the pipeline and draining the pipeline, respectively. In other architectures, this technique has made software pipelining profitable only if the loop is known to run long enough to make up for the overhead. This restriction turns out to be impractical for a compiler to determine, and generally means that compilers only pipeline loops involving floating point operations, since these tend to loop many times and the compiler often can compute the number of times because floating-point loops tend to have constant bounds. Software pipelining is seldom applied
to integer loops because the overhead of having extra code is often a factor in slowing a program due to increased use of the instruction cache. And software pipelining is often not applied to loops with non-constant “trip counts,” the number of iterations a loop will take, because the prologue and epilogue code is more complex to create. EPIC uses an elegant solution called rotating registers to allow very compact representation of software-pipelined loops. With EPIC, a compiler can produce softwarepipelined loops without having to generate prologue or epilogue code. The resulting code is simpler and takes up less space. This simplification makes it more likely that software pipelining will speed up code, so a compiler can be more aggressive about using this technique without the penalties you would expect with other architectures.
TeamUnknown Release
Foreword Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
ARCHITECTURES VS. MICRO-ARCHITECTURES These days everyone seems to know that computers will be faster in a year than they are now. An observation, known as Moore’s Law, that computers seem to double in speed every eighteen months seems to have become a way of life. When designing a computer, one selects an architecture, a micro-architecture, and an operating frequency target (MHz or GHz). Architectures seldom change. Microarchitectures change less often, and operating frequencies change often. There are no rules for the length of time between any of these, but many people think of architecture changes as 10–25 years apart, micro-architectures change each 3–6 years and operating frequencies change a few times a year. The IA-32 architecture includes the Intel486, Pentium processor, and P6microarchitecture based processors, each with a different powerhouse, a different microarchitecture. Each micro-architecture offers the opportunity for many variations in speed. The Pentium processor was offered starting at 60 MHz, and eventually, as high as 200 MHz. That is an extraordinary range, and some significant changes in the microarchitecture, such as adding an extra pipeline stage, made this increased clock speed possible. But, the general design held. Likewise, the P6-microarchitecture started at 150 MHz in 1995, and by early 2000, it sold in 1 GHz (1000 MHz) versions. Again, a few very significant revisions to the micro-architecture occurred which made this possible, but the core micro-architecture held steady. New architectures are more rare, and EPIC is one of these rare opportunities. In Intel’s main family of microprocessors, the architecture changes have come at the same time that we have worked to change the fundamental “width” of the processor. Speculation and predication are the keys to making explicit parallelism useful. Without these capabilities, the parallelism would go unused so often that a very wide machine would never be utilized enough to justify building it. These capabilities are further supported by predicate registers, parallel compares, and rotating registers. Readers of this book can understand these concepts early and understand the need for the many instructions documented in this book.
James Reinder Itanium Software Tools Technical Marketing Intel Corporation
TeamUnknown Release
Preface Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Preface Faster, cheaper, better. That’s what you would call the norm for computers, now and into the future. Moore’s Law tells us to expect computers to double in speed every 18 months. In the year 2000, Intel introduces its new Itanium™ processor, the first of a new family of 64-bit processors using the Itanium instruction set and based on revolutionary explicitly parallel instruction computing (EPIC). Development of the Itanium architecture signals the beginning of another era of computing—the evolution from 32-bit to 64-bit architecture for general-purpose computer systems and the trend toward increased use of parallelism. Itanium™ Architecture for Software Developers enables the expert and the novice alike to understand the Itanium architecture. This book strips away the mysterious jargon that computer architects and compiler engineers use to explain how we got here and where we are going. By explaining terms like pipelined architectures, caches, superscalar, outof-order execution, dynamic execution, register renaming, RISC, and branch predication, this book reveals how Intel has managed to adhere to Moore’s Law so far. Itanium architecture delivers an array of new capabilities, such as instruction level parallelism, software pipelining, speculation, prediction, global instruction scheduling, and ALATs. These concepts show how future computer designs will keep the industry tracking according to Moore’s Law. More than any microprocessor development in recent memory, this architecture needs explanation. This book makes sure that all these terms are well explained and their intended uses are demonstrated. Samples of code written for an Itanium processor help the reader become familiar with these concepts and their implementation. Itanium™ Architecture for Software Developers is the first of a series of books for the software developer. Other books in the forthcoming series cover programming, how to write high-level-language programs that benefit from the resources of the Itanium processor, and performance tuning, how to measure and redesign application code to realize the Itanium processor’s higher performance. This series of books about Itanium architecture provides engineer-to-engineer communication from Intel Press.
TeamUnknown Release
Chapter 1 - Introduction Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Chapter 1: Introduction OVERVIEW The faster you make a computer run, the harder it is to envision how to make it run still faster. It is no longer sufficient to just make the architecture process instructions faster. You must make more things happen at once—that is, in parallel. The computer is not running faster—it is simply getting more done at once. The Itanium architecture employs a new parallel computer architecture called explicitly parallel instruction computing (EPIC). Intel Corporation’s first hardware implementation of this architecture is the Itanium processor. EPIC offers many new features to increase performance, but the most important are explicit instruction level parallelism (ILP), speculation, predication, and a large register file. Each of these makes a computer run faster. But what do you do with all this parallelism? The traditional problem is that we never have enough work ready to keep a machine fully busy. If we have all this power, the key is to not throw it away by letting it go unused. Anytime the processor is ready to do six more things, we don’t want to give it two things to do and waste 66% of the processing power. A breakthrough happens when you decide to stop worrying about only doing things we must. The approach is to ask the machine speculatively to do four more things. That is, things are picked that might be needed in the future, but right now we just aren’t sure which of the four it will be. Let us say that the four extra things turn out to be needed only half the time. We are still better off than if we had asked the processor to do nothing. Half of the time the work is done by the time it is needed. In fact, this is the promise of explicitly parallel instruction computing. As machines get wider and wider, we just speculate more and more. The goal is to compute results before they are needed, so that when a program needs them, the answers are already there. Nothing is faster than something that is already done!
In a traditional microprocessor, doing things in advance can create trouble. For instance, if A+B is computed too early, the values of A or B might change and make the original result useless. The sum of A and B must be recomputed. Because of these problems, the architecture requires special features for speculation that make these exceptions efficient to detect and correct. Speculation is a big part of what EPIC offers, and it is offered in many forms. But, all the speculation features are simply in EPIC to allow us to compute things before they are needed, and before a check is made to ensure that it is not too early to do the work. EPIC builds on the abilities of a very long instruction word (VLIW) machine by allowing the selection of instructions to be executed in parallel to be done at compile time. For this reason, compilers may be the most important component of EPIC. Compilers to harness the instruction level parallelism are very complex. EPIC architecture is optimized for compiler writers, to simplify the task of writing the advanced compilers needed for EPIC. The goal of Itanium architectural features, such as speculation, predication, and a large register file, is to simplify the problems that a compiler writer faces so that it is possible to write good compilers to use EPIC. The first thing EPIC offers to compilers is explicit access to the parallelism—hence the term EPIC. This access is a bonus for the compiler writer and the microprocessor designer. Both jobs are simplified by opening up the communication between compiled code and the hardware. Once explicit parallelism is available, the challenge is to make full use of it. Speculative execution of instructions is the most important concept in EPIC for helping a compiler fill up the available parallelism. It is what makes it possible to believe that programs can be written to use the parallelism available. More importantly, speculative execution allows compiler writers to compile programs into parallel code for the processor. Predication is the second most important concept in EPIC for filling up the available parallelism. Practically every instruction that is executed is conditional, based on a predicate register. Each instruction either has an effect, or has no effect, based on the True or False condition in a predicate register. Branches in program control impede full use of parallelism. This predication of every instruction, with no extra run-time cost, allows the elimination of a branch, thereby increasing parallelism. EPIC also offers very large register sets. A smaller register set limits performance. The processor has to shuffle data in and out of registers instead of doing the work required for the program. The registers in the Itanium architecture even offer a feature called rotating registers, which reduces the need for register shuffling and code duplication when doing a type of loop level parallelism called software pipelining. Finally, EPIC acknowledges that the parallelism, which a compiler can find, is not consistent from clock to clock, and machine width is not fixed for every microprocessor design. The Itanium processor is six execution units wide, but cannot do every combination of six things at once. Therefore, EPIC allows code to express where the
parallelism is, without forcing all code to be six wide forever. Future Itanium processors may be more flexible about what they can do in parallel or how many things can be done in parallel, or both. In earlier VLIW machines, inflexibility was a substantial problem in their designs. Width changed from generation to generation of the machine and the code compiled for one machine did not work on another. That is, they were not scalable. EPIC solves the scalability problem and code can execute across all Itanium-based computers.
TeamUnknown Release
Chapter 1 - Introduction Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
CONCEPTS LEADING TO EPIC The two ways of making a computer go faster are: make it do each thing faster and make it do more things at once. Through these methods, a computer does more things in a given time and is recognized as being “faster.” Neither running fast, nor juggling lots of things at once is simple. A number of architectural innovations have driven the evolution from CISC microprocessors to EPIC-based processors and the resulting improvements in performance. This section explains these architectural capabilities.
Pipelining Consider a simple computer that executes one instruction, then another. Even this simple computer is complicated. The computer starts by reading the instruction from somewhere, interpreting it, and doing the required action. Often that action requires information (data) to be obtained (read) from somewhere, acted upon (perhaps added together) and the result to be written out. The computer ends up doing five things to carry out (execute) each instruction. They are: READ INSTRUCTION READ FIRST INPUT DATA READ SECOND INPUT DATA ACT ON DATA (ADD INPUTS TOGETHER) WRITE ANSWER OUT If the computer can read the next instruction at the same time as one of the other steps in the instruction execution process is taking place, instructions execute a little faster. This read cannot be done during a step that involves another read or write. For this reason, the read of the next instruction is usually performed overlapping with the “act on data step”. This architectural innovation is known as pipelining.
Load/Store Architecture The simple computer used to demonstrate pipelining is in fact more complicated than it needs to be. If the rules are changed to eliminate the data read steps, the computer can be simpler. The catch here is that an instruction is not what it used to be. In our original design, an instruction could get its data from anywhere, specifically, the memory subsystem. An instruction is no longer able to read values from two memory locations, ADD them, and write the sum to memory. Now, the data must be held in registers within the processor. The architecture of the computer must be changed by adding instructions to read information in memory into registers and to write information in registers into memory. These additions are known as the load instruction and store instructions, respectively. Now separate instructions are available to load from memory to a register, add data in registers, or store the result in a register to memory, but only one operation can be done at a time. The computer is now a load/store machine. Data accesses are made with load and store instructions, no longer as part of the execution of another instruction. That is, data can only be loaded from or stored to memory explicitly. Load/store machines are easy to implement and generally run faster. Each instruction is easier to implement. The catch is that it now takes more instructions to do the same work. For instance, the single complex instruction from our earlier example, which added the values in two memory locations and stores the result in memory, is implemented with four instructions: LOAD INPUT 1 LOAD INPUT 2 ADD STORE RESULT The use of registers turns out to be efficient at reducing the number of memory loads and stores. If inputs already exist in registers or results can remain in registers, the need for memory accesses are eliminated. For instance, if the add operation in our example is actually an accumulate operation, input 2 and the result are located in the same register and the result is created by the add operation. Now the program sequence reduces to two instructions: LOAD INPUT 1 ADD INPUT 1 TO INPUT 2
How does our new computer perform the operation of our earlier complex instruction example? Let us assume that it takes the load/store architecture 2 instructions to do the same work. A first estimate is that the computer needs to run twice as fast just to keep up with the old way of programming. However, our new computer has three steps instead of five for each instruction: INSTRUCTION LOAD ACT ON DATA FROM REGISTERS [the act may be a LOAD or a STORE] PUT RESULT IN REGISTER As shown in Figure 1.1, instruction 1 is performed in three clocks.
Figure 1.1: Pipelined Execution in a Load/Store Architecture The interesting twist now is that the “instruction load” step for the next instruction can be performed at the same time as the “put result in register” step of the prior instruction. This pipelining can be accomplished because the computer is not busy talking to memory. In fact, if the “act on data” step is not a load or store that uses memory, the “instruction load” can be executed during this step. Figure 1.1 shows this overlapping of the execution of instructions 1 through 3. Now the three instructions are executed in five clocks. With these improvements, performance approaches one instruction per clock. Notice that during clock 3 all three steps of instruction execution are performed. The result of instruction 1 is saved in a register, the data of instruction 2 is acted upon, and instruction 3 is loaded. We could say that this computer is doing one instruction per clock. However, if the “act on data from register” operation is a load or a store, the next “instruction load” has to be delayed one clock.
Cache Occurrences of load and store operations interfere with the pipelined execution of instructions. Like other problems, this one has a solution. Another enhancement to the architecture called caches is the solution. A cache is simply a small, fast memory that is located within a processor. Adding a few extra circuits, the processor uses the caches automatically without requiring the programmer to do anything other than read and write memory just as one would do without caches. The cache memory allows many of the reads and writes for our example computer to happen at full speed by keeping copies of the value from parts of main memory in the
cache. When the processor reads or writes and the data exist in the cache, a read hit has occurred and the data is accessed from the cache, not main memory. This memory access is completed in one clock. Delays still happen when the cache does not have a copy of the data. This condition is called a read miss. In this case, the data must be read from the main memory. Delays also happen if the cache is full of written data, and needs to make room for newly requested data. In this way, we see that the existence of a cache alone does not assure that all memory accesses occur in one clock. Caches actually have the ability to solve several problems. Today’s microprocessors are faster than available memory chips. For this reason, multiple clocks are required to access information located in main memory. These extra clocks reduce the performance of the computer. For our simple computer to keep running faster, loads and stores must run in one clock a high percentage of the time. Single-clock memory accesses are achieved for all data held in cache. Whether you have a load/store architecture or not, lots of instructions use memory for data reads and writes. These loads and stores of data may interfere with the loading of instruction code in the “read instruction” step. Again, the impact is a decrease in pipelined execution. If we can avoid this conflict, our computer can truly feel like it is a one instruction per clock machine. Two standard solutions have evolved for eliminating the conflict between instruction reads and data loads and stores. One approach is to have two memory systems, one for instructions and another for data, and connect them to the processor in a way that they do not interfere. This configuration is called Harvard architecture. A more common solution is to use cache memory. The computer has just one main memory, but two caches—one for instructions and one for data. The code cache and the data cache are placed inside the processor in a way that they do not interfere. This architecture is known as a von Neumann architecture. This latter approach is preferred mostly because it requires only one set of connections to the main memory outside the processor. The reduced number of external connections is important because there is always a limit on how many pins you can have on the processor. Many instructions operate on two inputs and therefore require data to be loaded from two places. This last problem can also be resolved through the use of cache. The cache can be designed to retrieve two pieces of data at once. This organization is called a “multiported” data cache design. To maximize performance, we will assume our simple computer has the benefit of a cache that eliminates multiple clocks for memory accesses, removes conflicts between instruction reads and data loads and stores, and provides multi-port access to memory. The performance achieved with this redesign is much better than that of our original complex instruction computer design. Cache design can get much more complicated in order to be faster and faster, or to be able to serve multiple data in one clock. In this
sense, caches, the same way as whole computers, can be designed to do things faster, and do more things at once.
Branch Prediction One thing that computer programs do regularly is make a decision, and go to different pieces of code (instructions) depending on that decision. A branch instruction accomplishes this change in program control. If the result of the decision is to branch, we go to somewhere else to continue or if the decision is not to branch, we fall through the branch and execute the next instruction. Branching is generally bad, because it interferes with our pipelined model of reading the next instruction while executing the previous instruction. If the previous instruction was the branch, we find ourselves in a dilemma. We do not know where to get the next instruction in time to fetch it while executing the prior instruction. One approach would be to wait until we know, before fetching the next instruction. This approach of waiting ensures that we get nowhere during the clocks that we wait. Fully pipelined execution is interrupted. Two solutions emerged in processor designs. One approach is to guess where the next instruction is going to come from, and to fetch the next instruction from there. If we are correct, the pipeline remains full and parallel execution is maintained. If we are wrong, we need to figure that out and make things right later. The time that is lost recovering is known as a branch-mispredict delay. Predicting the place to get the next instruction is a form of simple speculation, which is a very important concept to the Itanium architecture. The other solution, which is more ingenious and worked better for a while, is called a delay slot. The idea is simple. You move the instruction immediately before the branch instruction to the line after the branch instruction. That is, you place it in the delay slot. Now a branch always must execute the instruction after the branch and then make the branch. In this way, the pipelining and parallel execution is maintained while the decision of the branch is made. This clever idea worked very well as long as you could figure out where the branch was going in the time the processor took to execute that single instruction in the delay slot, and as long as you could fill the delay slot with something useful. For a while, this worked. But as time went on, more than one clock of delay was needed to be sure of the branch’s destination and even if you added more delay slots, you had a harder and harder time filling them with useful work. So, the delay slots ended up just being delays, and the branch prediction scheme looked better and better. Eventually, all microprocessors migrated to branch prediction and the techniques to predict well have become more and more complex. The inability to predict the future (a branch) perfectly, every time, means that eliminating branches altogether seems like a great idea.
Branch Elimination—Predication Once we know that branches cause problems, it is natural to want to eliminate them. The process by which branches are eliminated is known as predication. Consider a simple piece of pseudo-code: if (a > b) { x= a z= 1 } else { x= b z= z + 1 } If the processor has the ability to predicate an action, you can rewrite your program to avoid branching all together: test = TRUE if if (test if (test x = tmp1 if (test if (test z = tmp2
(a > b), else FALSE is TRUE) tmp1 = a is FALSE) tmp1 = b is TRUE) tmp2 = 1 is FALSE) tmp2 = z + 1
This predication capability is added to many microprocessors today, including the Intel Pentium III processors. This feature turns out to be very hard to use, since the processor now executes a lot more instructions than were required for the original code with the branch. To pick the best code, you need to compute how often the branch is mis-predicted (has a penalty) versus correctly predicted (no penalty). The new code is clearly slower if the branches would have been predicted correctly, and might be faster only when they would have mispredicted. The concept of predication makes a lot more sense in EPIC, where it is combined with the ability to execute a lot of instructions at once (parallelism). In this case, the limited number of instructions inside the then branch (x=a and z=1) and the else branch (x=b and z=z=1) limit our performance. For code with a branch, a machine that can do six things at once can finish in 3 clocks plus any delays due to mis-predicted branches, for code with a branch: clock 1 clock 2 clock 3 and
test = TRUE if (a > b), else FALSE if (!test) goto LABEL1 x = a z = 1
and
goto LABEL2 LABEL1:
clock 3 and
x = b z = z + 1 LABEL2:
Without a branch, the machine can do the operation in 2 clocks, with no branch delays: clock1 clock2 and and and
test= TRUE if(a > b),else FALSE x = a only if test is TRUE x = b only if test is FALSE z = 1 only test is TRUE z = z + 1 only test is FALSE
Superscalar The predication example is all an excellent lead-in for the next topic: parallelism, that is, the ability to do more things at once. In our simple computer, fetching instructions while still executing previously fetched instructions is parallelism, in the sense of doing several things at once. However, we called this pipelining and not parallel execution. The reason is simple. With pipelining, each instruction is in a different part of its execution at any point in time. Each starts by itself and each ends by itself. If we start multiple instructions at the same time, we call it parallel execution. The net effect is similar; the computer is faster because it is getting more work done. The simplest form of parallel execution is called superscalar. Scalar is a reference to doing one thing at a time, and superscalar means doing more than one thing at a time. The Pentium processor from Intel is a superscalar processor. Instead of starting one instruction at a time, the processor also starts the next two instructions at the same time whenever possible. This is not always possible. A common problem is that one instruction computes an answer that the next instruction uses as its input. Thus, the second instruction must wait for the first to be done before it can start. Naturally, such instructions cannot be done in parallel. A simple superscalar design like the Pentium processor is forced to wait for the first instruction to complete before doing the next instruction. After one instruction is done, the processor checks whether the next two can be run in parallel. This way, every clock the processor tries to start two instructions but falls back on starting just one when the two under consideration have some dependency on each other.
Dynamic Execution A variety of techniques can improve a simple superscalar design. The objective is to find a way to always execute instructions in parallel, even if the next two instructions have a dependency.
Intel microprocessors based on the P6-microarchitecture employ a collection of these techniques called dynamic execution. The Pentium Pro, Celeron, Pentium II, and Pentium III processors are all based on the P6-microarchitecture. To solve this problem, the microprocessor looks at more than the next two instructions when deciding which instructions to execute in parallel. The hardware solution is a complex, multi-stage design that effectively looks ahead at a window of instructions, about the next twenty, and selects up to three that can be executed in parallel. The three instructions selected generally are not three in a row, so the processor has to do a lot of work to keep track of the order and get the proper results. The effect is dramatic. Suddenly, the superscalar design that struggled to execute two instructions in parallel, if they happened to occur in the proper sequence, looks ahead to find instructions which can run in parallel. This technique is successful enough to often find three instructions to do in parallel. The problems of dynamic execution are more complex than just looking ahead and grabbing instructions to run. Techniques like register renaming have to be used to break what we call false dependencies, which have to exist due to the small register set in the IA-32 processors. There are other complications as well; each of which had to be satisfied in order to make dynamic execution a reality. EPIC avoids the need for such complex hardware. Parallelism is explicitly specified in the instructions instead of being something for which the processor must hunt to run programs fast. Dynamic execution became a necessity when the time came to run three instructions at once using an instruction set that started with microprocessors that ran one instruction at a time. The Itanium architecture starts with an implementation that can run six instructions at once, without employing superscalar detection hardware and dynamic execution.
TeamUnknown Release
Chapter 1 - Introduction Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
INTEL® ITANIUM™ ARCHITECTURE—NEW FEATURES EPIC is about parallelism. Unlike earlier parallel (VLIW) designs, Itanium architecture does not use a fixed-width instruction encoding. Instead, instructions can be combined to operate in parallel from one to as many instructions as desired. To make this variable width work, one simple rule was imposed that earlier parallel designs did not impose. Programs must be written to work assuming sequential semantics. Consider a program to swap data: TMP = A; A=B; B=TMP In a computer with parallel semantics, a program can simply tell the machine to execute A=B and B=A in parallel. With the Itanium architecture, you may not do this because sequential semantics would not swap the data. That is, A=B and B=A in parallel means something different than doing A=B then B=A. This simple restriction makes it possible to build machines with different levels of parallelism, to execute the same programs and get the same results. This scalability is another important characteristic of the Itanium architecture. The rest of the new features in the Itanium architecture are designed simply to allow a compiler to produce code to use all this parallelism. The three key concepts to understand are: Speculation Predication (and Parallel Compares) Rotating register file These capabilities make explicit parallelism useful. Without them, the parallelism would go unused so often that a very wide machine would never be busy enough to justify building it. Reviewing these concepts early should enable you to appreciate the need for
the many instructions documented in the remainder of the book.
Speculation In a modern computer, loading data as early as possible is very useful for speeding up systems. If the data is not in cache when loaded by the program, execution must wait while the data is retrieved from memory. This delay is due to memory latency. This potential for delay leads to an obvious desire to do load instructions as early as possible, preferably so early that even waiting for memory is not a problem. Realistically, the compiler can try to eliminate at least some of the memory latency delays by moving LOADs earlier. Several things limit how early one can speculate a load from memory, and have it be useful: Data is loaded so early that the register holding the data has to be used for something else before the program gets around to using the data that was loaded. Data is loaded before it is certain that the memory access would actually happen in the program (control speculation). Data is loaded before the necessary data has been computed and stored in memory (data speculation). In most machines, these limitations severely impede our ability to do loads early enough to get dramatic speed-ups. With Itanium architecture, each of these problems is addressed. Many more details are given in the chapters of this book, but here is a summary of the problems and their solutions. Problem #1: Data is loaded so early that the register holding the data has to be used for something else before the program gets around to using the data we loaded. Solution: The architecture provides lots of registers: 128 for integers data, 128 for floating point data, 64 for predicates (True/False). Problem #2: Data is loaded before it is certain that the memory access would actually happen in the program. The problem here is that the load could cause a fault (possible program termination) in cases where a branch in program execution would have resulted in the LOAD never taking place. Solution: Instead of using a normal load operation, the processor executes a speculative load (ld.s) instruction earlier in the instruction steam. The register uses a bit associated with it known as a Not a Thing (NaT) bit (NaTVal for floating-point registers) to keep track of whether or not the data is valid. Later, when the load is known to be necessary, a
speculation check (chk.s) instruction checks the NaT bit to confirm that the data is still valid and if invalid data is detected initiates a recovery. Problem #3: Data is loaded before the necessary data has been computed and stored in memory. Solution: Instead of using a normal load, the processor executes an advanced load (ld.a) instruction earlier in the instruction stream. An entry in a special register called the advanced load address table (ALAT) marks the occurrence of an advanced load. Later, when the load is proven to be necessary, a check load (ld.c) instruction checks the entry in the ALAT to confirm that data is valid and to initiate a recovery if it is not. Problem #4: Data is loaded both before it is certain that the memory is accessible and that no stores can overwrite the memory. Solution: Use an advanced load (ld.a) instruction then an advanced load check (chk.a) instruction instead of a load check (ld.c). This solution is simply a combination of the concepts mentioned in the last two problems.
Speculating More Than Loads Once the processor is loading data early, it is only natural to start doing computations on the data early too. Doing computations early simply means using the data from an advanced or speculative load before it can be certain that the data was really loaded, or that it was valid data. By loading data early, and starting a chain of dependent computations (ones that use the data as input), the program can get work done earlier by using up the parallelism available in the processor. The consistent theme is using the parallelism available in Itanium architecture to get work done even before the results are need, or before all the inputs are ready (the loaded data). The trick is to fill up the available parallelism with those loads and computations that are most likely to be needed. In this way, the processor can maximize use of the benefits that the extra parallelism from Itanium architecture has to offer. The problem with doing the computations early, based on loads that may not have gotten the right data, is what to do if we eventually determine that the loaded data is not valid. In the case of an advanced load, this mistake could mean that the wrong data was loaded because a store has overwritten that memory location and changed the data. For this case, the advanced load check identifies data that is invalid and initiates a branch to recovery code. Since more happened to the data than just loading it, the fix-up code needs to do more than just reload the data. It needs to redo the computations with the correct data. Recovery code is something we want to use only occasionally. In the normal case, we should get the benefits from speculating correctly and from getting the resulting
load and computations done early.
Predication A predicate register controls practically every instruction defined by the Itanium architecture. Most instructions are written to make no use of predication. Instead, the compiler specifies a predicate register to control them that is always True. Therefore, instructions that appear to not be predicated in the assembly language are actually coded to use the predicate register, which is always True. The use and benefits of predication were covered earlier. In summary, predication allows instructions to be executed in parallel, with some instructions turned on and some instructions turned off, but without need of branches. The elimination of branches is beneficial, and the compiler often can pack the resulting predicated code with nonpredicated code. Greater compactness more than makes up for the fact that instructions with false predicates are using up available parallelism in the machine to do nothing.
Parallel Compares Predication of instructions by predicate registers means that the processor needs to compute the True/False values to be placed in the predicate registers. Three potential problems are identified when instructions start using predication: Problem #1: It is useful to have the complement of a predicate. Solution: The compare instructions can write the result to one predicate register and the complement to another. This separation allows a test for A>B to place the value True in one predicate register where it is used to predicate the code to be executed when A is greater than B, and to place the value False in another register to predicate the code to be executed when A is not greater than B. Problem #2: If your program is comparing data, at least one of which originated from an invalid speculative load, it would be useful to not execute any code based on either output predicate because the comparison is useless. Solution: Compare instructions set both output predicate registers to False if either of its inputs is an invalid result of a speculative load. Problem #3: Compare instructions can only compare two numbers at a time, which implies that computing the value of the expression((A>B) OR (B>C) OR (C>D)) must be computed as follows: 1. 2.
1. (A>B), (B>C), and (C>D) in one clock. 2. Combine two of these results in another, and 3. Combine that result with the third comparison in a third clock. To compute a predicate for this expression takes three clocks, after the values of A, B, C and D are known, and the action takes five instructions. Solution: The architecture provides special parallel compare instructions that are an exception to the rule that multiple instructions cannot write the same register at the same time (in parallel). The trick is that these instructions come in two versions. One writes True or does nothing; the second writes FALSE or does nothing. Using parallel compares, the computation of the proper predicate for this example expression takes only four instructions and one clock after we know the values of A, B, C and D, much better performance than we could expect without these instructions.
Rotating Register File Another frequently used program structure that can adversely affect performance is the loop. Itanium architecture provides a number of programming constructs that permit loops to be performed more efficiently. Combining these with the parallelism naturally leads one to do software pipelining in this architecture. Software pipelining is a technique that re-codes loops to execute from multiple iterations in parallel with each other. Software pipelining normally leads to the need for extra special-purpose code in front of a loop and after the loop. This code, called the prolog and epilogue, deals with priming the pipeline and draining the pipeline, respectively. In other architectures, this additional code has made software pipelining profitable only if the loop is known to run long enough to make up for the overhead. This estimation of time turns out to be impractical for a compiler to determine. Generally, compilers only pipeline loops involving floating-point operations since these tend to loop many times and the compiler often knows the number of times. The number of iterations to complete a loop is called the trip count. Software pipelining is seldom applied to integer loops because the overhead of having extra code is often a factor in slowing a program due to greater use of the instruction cache. Software pipelining is also often not applied to loops with non-constant trip counts because the prologue and epilogue code is more complex to create. Itanium architecture uses an elegant solution called rotating registers to allow very compact representation of software-pipelined loops. With Itanium architecture, a compiler can produce software-pipelined loops without having to generate prologue or epilogue code. The resulting code is simpler and takes up less space. This simplification makes it more likely that software pipelining can speed up code. For this reason, the compiler can
be more aggressive about using this technique without the penalties you would expect with other architectures.
TeamUnknown Release
Chapter 2 - Data, Code, and Memory Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Chapter 2: Data, Code, and Memory In the last chapter, you were introduced to the EPIC and its implementation in the Intel Itanium processor—the Itanium architecture. The execution environment of this 64-bit architecture consists of the data types, register resources, and memory address space used during the execution of an application program. These elements define the execution state of an application program. The application environment is similar to the execution environment, but it includes only resources that the application programmer can access directly through software. This chapter examines data and code and how they are organized in memory.
DATA TYPES, SIZES, AND FORMATS Four data types are used to express information, such as integer data and addresses, in the Itanium architecture. Table 2.1 shows the directly supported data types: integer, ordinal, pointer, and immediate. These data types differ according to their use within the architecture. For example, the integer data type is used to express the value of data in a register or storage location in memory, while the immediate data type is used for coding a value as an immediate data into an instruction. On the other hand, the value of a pointer is an address that identifies a storage location in memory. Floating-point data formats are covered in a separate chapter. Table 2.1: Data Types and Sizes Type/Size
Byte
Word
Double word
Quad word
Integer
Ö
Ö
Ö
Ö
Ordinal
Ö
Ö
Ö
Ö
Ö
Ö
Ö
Ö
Pointer Immediate
Ö
Ö
The data types are supported in four standard sizes: byte, word, double word, and quad word. Their lengths are 8 bits, 16 bits, 32 bits, and 64 bits, respectively. Bits within a unit
of data, such as a word or quad word, are always numbered from 0, starting with the least-significant bit. Therefore, the most significant bit of a double word is in bit position 31. Notice in Table 2.1 that not all data types use all of these sizes. For instance, pointers are only expressed as double words or quad words. In fact, all pointers are 64 bits long. 32-bit pointers are only used by IA-32 compatible application software. Table 2.2: Range for Unsigned and Signed Integer Numbers Type/Size
Byte
Word
Double word
Quad word
Min
0
0
0
0
Max
255
65,535
4,294,967,295
18,446,744,073,709,551,615
Min
–128
–32,768
–2,147,483,648
–9,223,372,036,854,775,808
Max
+127
+32,767
+2,147,483,647
+9,223,372,036,854,775,807
Unsigned
Signed
The values used as integer and immediate data can be either unsigned or signed numbers. For the unsigned version of a byte-wide integer number, all 8 bits are used to represent the value of the data. An example of an unsigned word number is: 100100012 = 91H = 14510 The range of unsigned decimal numbers corresponding to byte, word, double word, and quad word size data are shown in Table 2.2. The largest unsigned number that can be coded as a quad word of data is calculated as: 264 - 1 = 18,446,744,073,709,551,61510 The general formats for the byte, word, and double word signed integer numbers are shown in Figure 2.1. Notice that the most significant bit is the sign field, which holds the sign bit that identifies whether the number is positive or negative. For instance, in the byte-wide signed integer, bit 7 is the sign bit. The seven least significant bits hold the value of the data. If the sign field is 0, the number is positive, 1 stands for negative. For example, the word-wide signed number 01000000000011112 can be expressed in hexadecimal form as 400FH and is equivalent to decimal number +16,399.
Figure 2.1: Integer Number Formats Negative numbers are expressed in 2’s complement notation. To form the 2’s complement of a number, invert the logic value of each of its bits and then add 1. For example, the negative number –8 is not coded as 100010002; instead, it is 111110002 which equals F4H. Table 2.2 shows that the largest 16-bit negative number represents decimal number –32,76810 Expressing in 2’s complement notation gives: 10000000000000002 = 8000H: Table 2.2 shows the range of all positive and negative signed integer numbers. During the execution of a program by the Itanium processor, integer numbers are frequently zero-extended or sign-extended. For instance, when an instruction is executed to load a byte of data from memory to an internal register, the value is automatically zeroextended to 64-bits as part of the load operation. Zero-extension means that the higher order bits are simply filled with 0s. For example, a load instruction can read the byte of data FFH from memory, but the value is extended to 00000000000000FFH as it is loaded into the register. Signed numbers are treated differently. An example of a typical use of a signed integer is in arithmetic instructions. Immediate operands in addition and subtraction instructions are signed numbers, but are not 64 bits in length. When an immediate operand is added or subtracted from a value in a register, the immediate value is sign-extended before the arithmetic operation is performed. A signed integer number is sign-extended by filling the more significant bits with the value of the sign bit. For instance, if the 14-bit immediate operand in an add instruction is –1 (7FFFH), the value is sign-extended to FFFFFFFFFFFFFFFFH for use in the addition computation.
TeamUnknown Release
Chapter 2 - Data, Code, and Memory Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
APPLICATION MEMORY ADDRESSING MODEL Up to this point in the chapter, you have seen the data types available in the Itanium processor architecture. The pointer was identified as the type of data that is used to address memory and we noted that the standard length of addresses is 64 bits. However, 32-bit address pointers are also supported by the Itanium architecture for use with IA-32 architecture compatible code. Let us begin our study of the memory resource of the Itanium processor architecture with the application memory-addressing model. To ensure that the memory requirements of very large applications, such as data warehousing and e-Business, can be effectively supported, the Itanium processor architecture implements a very large virtual memory address space. Since addresses are 64-bit unsigned pointers, the virtual address space corresponds to the 264 (18,446,744,073,709,551,616) byte-wide memory locations over the address range from 0000000000000000H through FFFFFFFFFFFFFFFFH. This address space is said to be virtual because the registers used for addressing within the application register model are 64 bits wide, but the implementation in the Itanium processor does not physically support all of these bits with address lines in hardware. The 44 address lines of the Itanium processor provide 244 (17,592,186,040,322) unique physical addresses for implementing external hardware resources.
TeamUnknown Release
Chapter 2 - Data, Code, and Memory Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
DATA AND CODE ORGANIZATIONAND ALIGNMENT Even though the memory address space is byte-addressable, the byte is not the data size normally used for data accesses. Information in memory can be accessed in units of 1, 2, 4, 8, and 16 bytes. These units are named the byte, word, double word, quad word, and bundle, respectively. Although loads and stores are provided for the variously sized units of data, the basic data type is the quad word (8 bytes/64-bit). Fetches of instruction code from memory are always performed in bundles (16-byte unit).
Figure 2.2: Application Memory Address Space For better performance, all addressable information should be stored in memory on their naturally aligned boundaries. Hardware or operating system software may have support for processing unaligned data, but accesses of information organized in this way may result in a loss of performance. Examples of aligned data for each size data unit are identified in Figure 2.2. For example, to be aligned, quad words must start at addresses 0H, 8H, 10H, 18H, and on up through FFFFFFFFFFFFFFF8H. Since instruction code is read as bundles, each bundle must be aligned in memory on natural 16-byte boundaries—0H, 10H, 20H, and so on. How many bits are in an instruction bundle? All instruction bundles are 16 bytes long; therefore, they contain 128 bits of information.
TeamUnknown Release
Chapter 2 - Data, Code, and Memory Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
BYTE ORDERING Although Itanium architecture uses little-endian memory byte order by default, the byte ordering for data can be switched to big-endian under software control. The big-endian (be) bit in the user mask (UM) register controls whether loads and stores of data use little-endian or big-endian byte ordering. When this bit is 0, little-endian byte ordering is selected. Switching it to 1 changes the byte ordering of data to big-endian. The be bit does not affect the organization of instruction code in memory or the order in which code is fetched from memory. Code is always held in memory using little-endian organization. When using little-endian data format, the least significant byte within a multibyte unit of data is held in the lowest address storage location in memory and the most significant byte in the highest location. Figure 2.3(a) shows a quad word (8 bytes) of data arranged in memory using little-endian byte ordering. This quad word has the value 0F0E0D0C0B0A0908H. Notice that its least significant byte, which is 08H, is held at the lowest address storage location, which is address 0H. Moreover, the most significant of the eight bytes, which equals 0FH, is stored in the storage location with the highest address, 7H. When little-endian byte ordered data are loaded into a register from memory, they are stored with the lowest addressed byte right justified in the register. That is, the lowestaddressed byte of the unit of data in memory is placed in the lowest-order byte location in the register. Moreover, the highest-addressed byte of this element of data in memory goes into the highest-order byte location in the register. Figure 2.3(b) shows how the variously sized elements of data from a little-endian ordered quad word in memory get loaded into a register. Notice that the lowest of the four examples is denoted as LD8[0]. This instruction stands for "load 8 bytes of data starting from memory address 0H into the register." That is, the complete quad word is loaded into the register. Therefore, the value 08H, which is the least significant byte of the quad word in memory, is placed in the least significant byte location of the register, and the most significant byte of the quad word, 0FH, ends up in the most significant byte location.
Figure 2.3: Little-endian Data The uppermost register configuration in Figure 2.3(b) is identified as LD1[1]. This means "load one byte starting at address 1H in memory into the register." In this case, the value 09H from memory location 1H is copied into the least significant byte location of the register and the remaining bytes are zeroed. This is an example of loading a single byte of little-endian ordered data from memory into a register. In Figure 2.3(b), what size data does the example LD4[4] process? Where does this piece of data start in memory and what is its value? This instruction specifies four bytes of data with the first of the four bytes located at address 4H. From Figure 2.3(a), you can see that the value of this double word is 0F0E0D0CH. If software changes the be bit in UM to 1, big-endian ordering is selected. Using bigendian byte ordering, the most significant byte of a unit of data is stored in memory at the lowest address and its least significant byte at the highest address. For this reason, the relationship between addressed bytes in memory and their corresponding location in a register is changed. Now the lower-addressed byte of a data unit in memory corresponds
to the higher-order byte in the register. That is, the most significant byte of data in the register corresponds to the byte of the data unit at the lowest address byte location in memory. Figure 2.4(a) represents big-endian ordered information in memory. Even though the individual bytes are stored in the opposite order in memory, the value of the quad word that they represent is the same. From the bottom example in Figure 2.4(b), we see that loading the complete quad word starting at address 0H in memory again results with the value 0F0E0D0C0B0A0908H in the register.
Figure 2.4: Big-endian Data How would the quad word 08090A0B0C0D0E0FH be stored in memory starting at address 8H? The value 08H from the most significant byte location of the quad word is held at memory address 8H. This value is followed by the less significant bytes of the quad word in contiguous memory locations down through the least significant byte in memory address 15H.
You can use Figures 2.4(a) and (b) to explain the relationship between memory and register for the LD2[2] example. In this denotation, the word of data starting at address 2H in memory is to be loaded into the register. Since big-endian byte ordering is in use, the most significant byte of the word is at address 2H. Therefore, the value 0D0CH is loaded right-justified into the register.
TeamUnknown Release
Chapter 3 - Register Resources Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Chapter 3: Register Resources Chapter 2 examined the Itanium processor’s memory address space, data types, and the way that data and code are organized and aligned in memory. This chapter introduces additional architectural resources—the registers of the execution and application environments. The function and capabilities of each of the register files are covered as well as their role in application programming.
APPLICATION REGISTER SET AND APPLICATION REGISTER STATE The Itanium architecture provides a set of for programmers to use in their applications. With this large number of internal registers available, several computations can be performed at the same time without having to frequently store (write) and load (read) intermediate results to or from memory. This is an important asset of the architecture and contributes to the high level of software performance achieved by the architecture. Figure 3.1 shows the processor’s register files. They include 128 general registers, the instruction pointer, 128 floating-point registers, 64 predicate registers, 8 branch registers, a current frame marker register, user mask register, processor identifier registers, performance monitoring data registers, and up to 128 special purpose registers called the application registers. This chapter examines all but the floating-point registers. Chapter 9 covers the floating-point resources of the application architecture.
Figure 3.1: Application Register Set The application register state of the Itanium architecture represents the current contents of the registers that are affected by the execution of an application. The values in these registers can be saved, read, or modified by the programmer using instructions. The application register state does not include processor identifier registers and performance monitoring data registers, because they contain fixed information and track information about the execution of the application, respectively. That is, their content does not result from executing the application program. However, the application program can read their contents.
TeamUnknown Release
Chapter 3 - Register Resources Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
INSTRUCTION POINTER The instruction pointer (IP) register is used to access instruction code in the program memory. In the Itanium architecture, instruction code is always fetched from memory in a unit known as an instruction bundle. Therefore, IP holds the value of the bundle address that points to the next bundle of instructions that are to be read from memory for execution. IP is 64 bits long; therefore, it can address up to 18,446,744,073,709,551,616 bytes of memory. A value cannot be directly written into IP, but its value is initialized at reset. To enter into the application code, a new value of address is loaded into IP with a branch operation. Since instruction bundles are always 16 bytes in length, the least significant four bits of IP are always equal to zero. The address in IP is incremented as instruction bundles are fetched for execution. Since each bundle is 16 bytes long, the IP register is incremented by 1610 (10H) each time an instruction bundle is read from memory. For this reason, instruction bundles are stored in memory at 16-byte aligned address boundaries. For example, if an instruction bundle begins at address 0000000000000000H, the next bundle starts at address 0000000000000010H. The value in IP can be read directly by software.
TeamUnknown Release
Chapter 3 - Register Resources Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
GENERAL REGISTERS The application register set provides 128 64-bit general registers. Together, they form a general purpose, 64-bit register file that provides the central temporary storage resource for information used in integer computation and for movement to other registers. That is, the general registers are used to hold the value of the source and destination operands as they are processed by instructions of the program. They are also used to hold address pointers that select the memory location to be accessed during load (memory read) and store (memory write) operations. Programs at all privilege levels can access information in the general-purpose registers. The general register set is illustrated in Figure 3.2(a). Notice that the registers are identified as Gr0 through Gr127. As shown, each general register has 64 bits of normal data storage plus an additional bit called the Not a Thing (NaT) bit.
Figure 3.2: General Register Set
Operands and address information are held in bits B0 through B63 of the general register. Here B0 is the least significant bit and B63 the most significant bit. Data held in the general registers can be 8 bits, 16 bits, 32 bits, or 64 bits in length. Some instructions also have the ability to examine and modify an individual bit or group of bit locations in a general purpose register. A quantity loaded from memory into a general register is always placed in the leastsignificant portion of the register. That is, the value is placed right-justified in the target general register. Figure 3.2(b) shows how each size of data is arranged in the register. For example, a byte of data is held in bit locations B7 through B0, while a word of data is held in bits B15 through B0. Address information is normally64 bits in length; therefore, it takes up the complete general register. Loads of 1, 2, or 4 bytes of data are zeroextended to 64-bits prior to being written into their target registers. This can be particularly important when dealing with signed integers and 32-bit addresses, or any immediate operand that is shorter than 64 bits. The double word integer number F124ABCDH is loaded from memory into register Gr1. How are the bits of data organized in the register? The most significant bit, which equals 1, is held in bit B31 of Gr1. The rest of the bits, 11100010010010010101011110011012, are held in B30–B0. All unused more-significant bits are filled with zeros. Now, consider the significance of the Not a Thing bit that accompanies each register. NaT is a key element of the speculation capability implemented in the Itanium architecture. Remember that speculation is the process that enables scheduling and execution of instructions that load data from memory earlier in the instruction stream. If the execution of a speculative load instruction results in a fault exception, the exception is not raised at that time; instead, it is deferred. Setting the NaT bit associated with the general register used as the destination operand marks the occurrence of a deferred exception. This condition is tested by other instructions to determine whether the value stored in the register is valid. As shown in Figure 3.3(a), the general registers are partitioned into two subsets, based on their intended use—the static general registers and the stacked general registers. Gr0 through Gr31 are named the static general registers. These 32 registers are visible to all software and are to be used for general temporary storage of data and address information processed during execution of the instructions.
Figure 3.3: Static and Stacked General Registers Gr0 is special in that it is read-only and always contains a value of 0. This register can be used for clearing of other registers and it enables an efficient method of implementing the move instruction. An attempt to write to Gr0 causes a fault exception. General registers Gr32 through Gr127 represent the stacked general registers. The stacked registers are made available to a software function by allocation of a register stack frame. Using these registers, the function defines a register stack frame that begins at Gr32 and that can be defined to include from 0 to 96 registers. The unused registers are not visible to the function. The stack frame is further segmented into two groups, called local registers and output registers. Figure 3.3(b) shows the organization of a typical stack frame. Upon creation, all of the registers are assigned to the output group. However, the size of the local and output groups can be adjusted to meet the needs of the function under software control. Once registers are allocated to the stack frame, their use is local to the function for which
they were created, so they cannot be accessed by any other part of the program. When the function is complete, the registers in the stack frame are deallocated and available for use by another function. This access restriction does not mean that the stacked general registers are fully dedicated to a routine and cannot be used by other software. In fact, they are shared by many routines but are used by only one routine at a time. Let us assume that another function is called from within the current function. This initiates a switch in program control from the original function to the new function. As part of this process, execution of the original function is stopped and the program state is saved to permit a return to the original function. A new stack frame is created for the function, then the new software routine is started. The phrase “saving the program state” means that the content of the register stack frame for the original function is saved. This saved information permits backward linkage to resume the original function as the new function runs to completion. This switching of state frees up the stacked general registers for reuse and another register stack frame is created, with the size defined by the new function. At completion of the new function, its register stack frame is deallocated, then the content of the original register stack is automatically restored, permitting the original function to resume execution. When nested functions occur in a program, a new register stack frame is created for each new nesting level. In the Itanium architecture, the program stack frame switching operation occurs automatically. The function call and return mechanism does not automatically save or restore the contents of the static general registers. If the program needs to preserve the information in them, software must be included at the function boundaries to save and restore them. Chapter 7 covers function calling in more detail. Any number of the stacked registers can be set up to operate as rotating registers. This register mode supports a methodology used to improve the performance of loop operations. The loop is an important software structure widely used in application programs.
TeamUnknown Release
Chapter 3 - Register Resources Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
PREDICATE REGISTERS The next set of registers is an important element of the IA-64 architecture’s predication mechanism. For this reason they are called the predicate registers. In Chapter 1, you were introduced to predication, the process that the Itanium architecture uses to make the execution of an instruction conditional. As shown in Figure 3.4, the architecture provides 64 one-bit-wide predicate registers in the application register set and they are labeled Pr0 through Pr63. Each predicate register holds what is called a predicate, which is set to 1 or reset to 0 based on the results of executing a compare instruction. If a bit such as Pr1 is set to 1, it means True; logic 0 in the register stands for False.
Figure 3.4: Organization of the Predicate Registers In application programs, the state of a predicate register is called a qualifying predicate and its value determines how the processor executes an instruction. Execution of most instructions can be gated by a qualifying predicate. If the predicate is True, the instruction executes normally; if the predicate is False, the instruction executes, but does not modify the application register state. The branch instructions used for loop support are examples of instructions that cannot be predicated. For example, suppose that the predicates of a compare instruction that determines if A > B are held in registers Pr1 and Pr2. The value in Pr1 corresponds to the condition that A is greater than B, and the value in Pr2 represents the result that A is less than or equal to B. If the values of A and B are –15 and +10, respectively, when the compare instruction is executed, what states are produced in the predicate registers? The conditional test made with the compare instruction is –15 > +10 The conditional relation is not satisfied. Therefore, Pr1 is made 0 and Pr2 is set to 1. From Figure 3.4, you can see that the predicate registers are divided into two groups, the static predicate registers and the rotating predicate registers. Predicate registers Pr0 through Pr15 are the static predicate registers, and they are used in conditional instruction execution and branching applications. The other 48 predicate registers, Pr16 through Pr63, form the rotating predicate registers group. Similar to general register Gr0, the first predicate register, Pr0, is read-only and contains a fixed value. Since this predicate is always read as 1 for True, it can be assigned to instructions to make their execution unconditional. Examples are unconditional branch, call, and return operations. If an instruction is executed that writes to Pr0, the resulting value is discarded. Like the rotating registers implemented in the stacked general registers, rotating predicate registers support your efforts while tuning the performance of loop operations. In this case, the values in the registers are known as stage predicates because they are used to identify the stages of a loop. The last section in Chapter 7 covers software pipelined loops.
TeamUnknown Release
Chapter 3 - Register Resources Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
BRANCH REGISTERS The branch is another software operation that is frequently used in application programs. It initiates a change in the sequence in which instructions are executed, and is initiated in the Itanium architecture with a branch instruction. When a branch occurs, program control flow is transferred to an instruction at a new address. Thus, the value in the instruction pointer register must be changed, so branch instructions have the ability to load a new value of address into the IP.
Figure 3.5: Branch Registers The application register set includes 8 64-bit-wide branch registers that support the processor’s branching mechanism. As shown in Figure 3.5, they are called Br0 through Br7. When an indirect branch is initiated—that is, a branch in which the start address is specified indirectly as the content of a register—the value of the new address is loaded into IP from a branch register. That is, branch registers are used to hold the target address where program execution is to resume. The target address used for a branch operation must be bundle aligned. What does this mean relative to the location of the first instruction of the software routine being branched to? First, for the target address to be bundle-aligned, it must be on a 16-byte boundary.
The four least significant bits of the branch to address that is held in a branch register are “don’t care” states. Moreover, the instruction where the new program segment begins must be the first instruction in the bundle.
TeamUnknown Release
Chapter 3 - Register Resources Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
CURRENT FRAME MARKER REGISTER Whenever a function is called, a register stack frame, which is uniquely defined for the routine, is created from the stacked general registers. Each register stack frame has a frame marker associated with it. The values in the frame marker define the state of the general register stack that exists whenever the function is performed. When a function is running, its frame marker is held in the current frame marker (CFM) register. The layout of the frame marker is shown in Figure 3.6. The information held in the frame marker describes the sizes of the various portions of the stack frame and identifies three register rename base values. Table 3.1 lists the frame marker’s fields of information. Notice that the size of the stack frame is described by three parameters. They are size of stack frame (SOF), size of local portion (SOL), and size of rotating portion (SOR). SOF is held in the seven least significant bits. Seven bits permits a range from 0 to 127, but the stacked registers of the IA-64 register set limits the size of the stack frame to a maximum of 96 registers. The next seven bits, SOL, identify how many of the registers in the stack frame are assigned to the local register group. In this way, we see that any or all of the registers in the stack frame can be made local registers. The size of the output area (SOO) is found by simply subtracting the value SOL from that of SOF.
Figure 3.6: Frame Marker Format Table 3.1: Frame Marker Field Descriptions Field
Bit range
Description
sof
6:0
Size of stack frame
sol
13:7
Size of local portion of stack frame
sor
17:14
Size of rotating portion of stack frame—the number of rotating registers is 8 times SOR
rrb.gr
24:18
Register rename base for general registers
rrb.fr
31:25
Register rename base for floating-point registers
rrb.pr
37:32
Register rename base for predicate registers
Example Assume that a stack register frame is created with size equal to SOF = 48. If the size of the local register group SOL = 16, how many output registers (SOO) are there? Which registers are in the local and output register groups? Solution The number of output registers is found as: SOO = SOF – SOL = 48 –16 = 32 Since the stacked register group starts at register Gr32’ the local register and output register groups consist of: Local register group = Gr32 – Gr47 Output register group = Gr48 – Gr63
The number of rotating registers in the stack frame is set by the value of SOR, which is located in bits 14 through 17 of the frame marker. For example, if SOR equals 01112, the number of rotating registers is SOR x 8 = 7 x 8 = 56 The content of the CFM register cannot be directly read or written by an application program. Instead, instructions are available to load it with the different elements of the frame marker. How does it get loaded with the appropriate frame marker information? Actually, this process has several steps. On the call of a function a frame marker is automatically written to the CFM that creates a new stack frame with no local or rotating registers (SOL = 0 and SOR = 0), but with size of the stack frame (SOF) equal to the size of output (SOO) of the calling function. Now the application program can reset the values of SOL and SOR. A program can configure part of the stacked register group as rotating registers. Another part of this process requires renaming of the bases of the rotating registers. As shown in Table 3.1, this information is also contained in the frame marker held in the CFM register. The register-renaming base fields are all reset to 0 when a new frame marker is loaded into the CFM, and they can be updated under software control.
TeamUnknown Release
Chapter 3 - Register Resources Itanium Architecture for Software Developers byWalter Triebel Intel Press 2000
Recommend this title?
APPLICATION REGISTERS The application registers are a file of special-purpose data and control registers. They provide processor functions for both the IA-32 and Itanium processor instruction sets and can be accessed by software applications.
The application register file contains 128 64-bit registers. Table 3.2 is a list of these registers. Notice that the current Itanium processor uses a large group of these registers for support of compatibility with the IA-32 software architecture. The function of these registers will be explained in Chapter 10 , which covers the compatibility of IA-32 software with the Itanium architecture. Another register, Ar40 , provides support for the floating-point unit. It is explained as part of the floating-point operations in Chapter 9 . Ar0 –Ar7 Kr0 – Kr7 Kernel registers 0 through 7 Ar8 –Ar15 Reserved Ar16 RSC Register stack configuration register Ar17 BSP Backing store pointer
Ar18 BSPSTORE Backing store pointer for memory stores Ar19 RNAT RSE NaT collection register Ar20 Reserved Ar21 FCR IA-32 compatibility Ar22 –Ar23 Reserved Ar24 EFLAG IA-32 compatibility Ar25 CSD IA-32 compatibility Ar26 SSD IA-32 compatibility Ar27 CFLG IA-32 compatibility Ar28 FSR IA-32 compatibility Ar29 FIR IA-32 compatibility
Ar30 FDR IA-32 compatibility Ar31 Reserved Ar32 CCV Compare and exchange compare value register Ar33 –Ar35 Reserved Ar36 UNAT User NaT collection register Ar37 –Ar39 Reserved Ar40 FPSR Floating-point status register Ar41 –Ar43 Reserved Ar44 ITC Interval time counter Ar45 –Ar47 Reserved Ar48 –Ar63 Ignored Ar64 PFS Previous function state Ar65
LC Loop count register Ar66 EC Epilog count register Ar67 –Ar111 Reserved Ar112 –Ar127 Ignored
Table 3.2: Description of the Application Registers Register
Name
Description
In Table 3.2 , you see that many of the application registers are not in use in the current revision of the Itanium architecture. Unused registers are either reserved or ignored. The reserved registers are to be used for architectural extensions in future Itanium processors. For example, registers Ar22 and Ar23 are marked as reserved. Any attempt to access a reserved register causes an illegal operation fault.
The ignored group of 16 registers in the range Ar112 through Ar127 are saved for backward-compatible extensions to be supported in future Itanium processors. A read of an ignored register returns a value equal to 0. Software may write any value to an ignored register and the hardware takes no action on th value written. The remaining application registers support application programming.
Kernel Registers Looking at Table 3.2 , you can see that the first group of application registers that provide processor functions is Ar0 through Ar7 . The eight kernel registers, labeled Kr0 through Kr7 , are used to convey information from the operating system to the application program. For instance, the operating system can put read-only, semi-static information such as the processor ID into one of the kernel registers.
Register Stack Engine Support Registers
IA-64 architecture provides a group of registers to support the register stack engine (RSE). The RSE is
a hardware mechanism that manages the stacked general registers. During nested function calls, it automatically saves and restores the register stack frame. That is, it moves the content of the stack frame between the general register file and memory.
The register stack configuration (RSC) register, which is in application register position Ar16 , is a 64-bit register that is used to control the operation of the RSE. The format of the RSC is illustrated in Figure 3.7 and its bit fields are described in Table 3.3 . Notice that fields are provided to set the mode of operation; assign a privilege level to RSE operation; select an organization for the RSE data transfers; and synchronize loading of the RSE. For instance, the code in bits 0 and 1, which represents mode , is used to select one of four modes of operation for the RSE. The modes differ in how aggressively the RSE pursues saves and restores of stack register frames. If mode is made equal to 112 , the RSE is se for "eager" mode. This means that it aggressively pursues both loads and saves. The reverse of this configuration is known as "enforce lazy" and selected by making mode equal to 002 .
Figure 3.7: RSC Register Format Field Bit range Description mode 1:0 RSE mode: controls how aggressively the RSE saves and restores register frames 0: eager store 1: eager load
RSE Modes Bit pattern Bit 1: eager loads Bit 0: eager stores RSE mode 00 Disabled
Disabled Enforce lazy 10 Enabled Disabled Load intensive 01 Disabled Enabled Store intensive 11 Enabled Enabled Eager
Table 3.3: RSC Field Descriptions pl 3:2 RSE privilege level: loads and stores issued by the RSE are at this privilege level be 4 RSE endian mode: loads and stores issued by the RSE use this byte ordering. Value of 0 is little-endian; value of 1 is big-endian loadrs 29:16 RSE load distance to tear point: value used in the loadrs instruction for synchronizing the RSE to a tear point rv 15:5, 63:30 Reserved
What is the configuration of the RSE if the value in the RSC register is 000000003FFF000AH?
Comparing this information to the format in Figure 3.7 , the configuration is:
mode = 10 = load intensive pl = 10 = privilege level 2 be = 0 = little-endian loadrs = 111111111111112 = value used for loadrs instruction synchronization
Looking at Figure 3.7 , we see that two bit fields in the RSC register, bits 5 through 15 and bits 30 through 63, are identified as reserved(rv ). They are examples of reserved fields and may have a use in future revisions of the Itanium architecture. For reserved fields, hardware will always return a value o 0 on a read. Software must always write zeros to these fields. Any attempt to write a nonzero value into a reserved field will raise a reserved register/field fault.
In our earlier description of the RSE, we pointed out that it implements a mechanism for automatically saving and restoring register stack frames from memory. The next application register we consider, Ar1 , contains the address pointer that identifies the starting location at which the current stack frame is hel in memory. That is, this is the location where the content of the first register of the stack frame (Gr32 ) would be saved. It is called the RSE backing store pointer (BSP) register .
The format of the BSP register is shown in Figure 3.8 . Since the pointer held in BSP identifies the location of a 64-bit value in memory (eight contiguous bytes), the three least significant bits are all zero BSP is a read-only register. An attempt to write into a read-only register results in an illegal operation fault.
Figure 3.8: BSP Register Format Application register Ar18 also holds a 64-bit pointer address. However, this register, which is known as the RSE backing store pointer for memory stores (BSPSTORE) register, holds the address of the location in memory to which the RSE will spill the next stack frame. Figure 3.9 gives the format of BSPSTORE. Note that the three least significant bits are an ignored (ig ) field.
Figure 3.9: BSPSTORE Register Format
The last register that works in conjunction with the RSE is application register Ar19 . It is the 64-bit RSE NaT collection (RNAT) register . The format of this register is shown in Figure 3.10 . When the RSE is spilling general registers to memory, it uses RNAT to temporarily hold the values of the corresponding NaT bits. Bit 63 always reads as 0 and ignores all writes.
Figure 3.10: RNAT Register Format
Compare and Exchange Value Register
Table 3.2 shows that application register 32 is the compare and exchange value (CCV) register . The compare and exchange(cmpxchg ) instruction of the instruction set requires three source operands. This 64-bit register must contain the compare value, which is used as the third source operand of the instruction.
User NaT Collection Register
Application register 36, which is called the user NaT collection (UNAT) register, is similar to the RSE NaT collection (RNAT) register we examined earlier in that it is used for temporary storage of NaT bits. However, it temporarily holds them when saving and restoring the general registers with the ld8.fill and st8.spill instructions. The organization of this 64-bit register is the same as that shown for the RNAT register in Figure 3.10 .
Interval Time Counter Register
An interval time counter (ITC) register is implemented in application register 44. The 64-bit value in this register automatically counts up at a fixed relationship to the processor’s clock frequency. Applications can directly sample the value in the ITC for making time-based calculations and performance measurements. System software can secure the ITC from nonprivileged access. When secured, a read of the ITC at any privilege level other than the most privileged causes a privileged register fault. The ITC can be written into only at the most privileged level.
Loop Program Structure Support Registers
We will now continue in the application register file to consider two registers that are provided for use in implementing software-pipelined loop program structures. They are the loop count (LC) register, Ar65 , and epilog count (EC) register, Ar66 . Itanium architecture supports two types of pipelined loop operations, the counted loop and while loop. Counted loop operations involve both the LC and EC registers, but while loops only employ EC.
When implementing a counted loop operation, the LC register can be initialized with a value called the loop count . This 64-bit value must equal one less than the number of times the loop is to be repeated. The value in LC decrements each time the loop is completed. The value in the EC register equals the number of stages into which the body of the loop routine has been partitioned. The format of the epilog count register is shown in Figure 3.11 . Here we find that the epilog count is just 6 bits. The upper 58 bits are identified as an ignored field. Since writes to an ignored field are ignored by the hardware, any value can be written to it; however, when an ig field is read a value of 0 always returned. Chapter 7 demonstrates the use of the loop count and epilog count in pipelined loop applications.
Figure 3.11: Epilog Count Register Format
Previous Function State Register In our study of the stacked general registers we found that every function had a unique register stack frame. Moreover, when we examined the current frame marker register, we found that its contents specified the organization of the current register stack frame. If another function is initiated from the current function, the value in the CFM register must be changed to that of the new function. To retain the ability to return to the original function at completion of the interrupting function, the value in the CFM register must be saved. This is the purpose of the previous function state (PFS) register. The PFS register, which is located in Ar64 , contains multiple fields. Figure 3.12 diagrams the PFS format and Table 3.4 describes its fields.
Figure 3.12: PFS Format pfm 37:0 Previous frame marker pec 57:52 Previous epilog count ppl 63:62 Previous privilege level rv 51:38, 61:58 Reserved
Table 3.4: PFS Field Descriptions Field
Bit range
Description
Notice that the fields are called previous frame marker (pfm ), previous epilog count (pec ), and previous privilege level (ppl ). When a br.call instruction is executed to initiate a function call, the values of these items are automatically copied from the CFM register, EC register, and current privilege level bits in the processor status register, respectively, into their corresponding field in the PFS register
During the return to the original function with a br.ret instruction, the values of pfm and pec are restored to the CFM register and EC register, respectively. The ppl also is copied into the PSR register, unless this action would increase the privilege level.
TeamUnknown Release
Chapter 3 - Register Resources Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
USER MASK (UM) REGISTER The user mask (UM) is another important register in the register set. It is a subset of the processor status register and is accessible to application programs. Figure 3.13 shows the format of the UM. Notice that it contains five active one-bit-wide fields. The values of these bits are set or reset through software and either configure how operations are performed in the Itanium application architecture or record information about execution of the instructions in the application program. Table 3.5 summarizes the meaning of each bit in the user mask. For instance, the bigendian (be) bit (position 1) is used to select between big-endian or little-endian organization for information held in memory. Notice that resetting this bit location to logic 0 selects little endian mode of operation. The next bit, which is labeled up for user performance, enables or disables automatic monitoring of instruction set execution performance. The two most significant bits in the UM, monitor floating-point low (mfl) and monitor floating-point high (mfh), are used to monitor the usage of the floating-point registers. For example, if any one of the upper floating-point registers is accessed, mfh is set to 1. The active bits of the UM can be read or modified under software control.
Figure 3.13: User Mask Format Table 3.5: User Mask Field Descriptions Field
Bit position
Description
rv
0
Reserved
be
1
Big-endian memory access enable: controls loads and stores but not RSE memory accesses
Field
Bit position
Description 0: accesses are done little-endian 1: accesses are done big-endian This bit is ignored for IA-32 data memory access. IA-32 data references are always performed littleendian.
up
2
User performance monitor enable for IA-32 and Itanium instruction set execution 0: user performance monitors are disabled 1: user performance monitors are enabled
ac
3
Alignment check for IA-32 and Itanium data memory references: 0: unaligned data memory references could cause an unaligned data reference fault 1: all unaligned data memory references cause an unaligned data reference fault
mfl
4
Lower (f2..f31) floating-point registers written. This bit is set to 1 upon completion of an instruction that uses register f2..f31 as the target register. This bit is sticky and is cleared only by an explicit write of the user mask.
mfh
5
Upper (f32..f127) floating-point registers written. This bit is set to 1 upon completion of an instruction that uses register f32..f127 as the target register. This bit is sticky and is cleared only by an explicit write of the user mask.
TeamUnknown Release
Chapter 3 - Register Resources Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
PROCESSOR IDENTIFICATION REGISTERS The next group of registers in the register set that we will consider consists of the processor identification (CPUID) registers. They are actually a file formed from five 64-bitwide registers that contains information about the Itanium processor’s CPU. The current CPUID file contains the name of the supplier of the device and a unique 40-bit identifier for the processor. Let us now look at the information contained in these registers. The layout of these registers is shown in Figure 3.14. Vendor information is located in CPUID registers 0 and 1, which are labeled CPUID[0] and CPUID[1], respectively. The name of the supplier of the CPU is coded in ASCII and permanently stored in bytes 0 through 15 of these registers. The earlier characters of the supplier’s name are stored in the lower-numbered byte locations. For instance, for Intel Corporation the I from Intel is in byte location 0 and the N from corporation is held in the byte 15 location.
Figure 3.14: CPUID Register File and Fields Table 3.6: CPUID Register 3 Field Descriptions Field
Bit range
Description
Field
Bit range
Description
number
7:0
Number of CPUID registers: the index of the largest implemented CPUID register (one less than the number of implemented CPUID registers)
revision
15:8
Processor revision number: a unique 8-bit value that represents the revision or stepping of this processor implementation within the processor model
model
23:16
Processor model number: a unique 8-bit value that represents the processor model within the processor family
family
31:24
Processor family number: a unique 8-bit value that represents the processor family
archrev
39:32
Architecture revision: an 8-bit value that represents the architecture revision number that the processor implements
rv
63:40
Reserved
The 40 bits of version information in CPUID[3] identify the processor. Table 3.6. summarizes the version information held in this register. Notice that it includes the processor architecture revision (archrev), family (family), model (model), processor revision number(revision), and the number of implemented CPUID registers(number). All of this information is permanently coded into the CPUID register file. For this reason, the CPUID registers are read-only. The processor version information is very important to an application program. By examining this information, the program can confirm that the processor in use is sufficient to run the application, enable certain features or capabilities based on the processor and architecture revision, and understand if software patches need to be installed to accommodate anomalies in the processor. CPUID register 4 provides general application level information about the features supported in the model of processor. The current revision of the architecture defines just one bit. Bit 0, which is the lb bit, stands for long branch instruction implemented. What does the number field in CPUID[3] stand for? This value tells how many CPUID registers are implemented in the processor. Its value is set equal to one less than the number of CPUID registers. The first revision of the Itanium architecture has the value 4 coded into this field.
TeamUnknown Release
Chapter 3 - Register Resources Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
PERFORMANCE MONITOR DATA REGISTERS The last registers we will examine are known as performance monitor data (PMD) registers. Figure 3.15 shows these registers. They are data registers that are used by performance monitoring hardware that is built into the processor. In these registers, the processor saves information about instruction usage that is gathered while the application is running. The content of these registers can only be read by application programs. By sampling and analyzing this information, an application can be tuned for maximum performance.
Figure 3.15: Performance Registers The PDM registers can be configured by the operating system to be accessible at all privilege levels. Moreover, the operating system is allowed to secure the user-configured performance monitors. If the content of a PMD register that is secured is read, the value returned is 0 regardless of the current privilege level. These registers can only be written into at the most privileged level.
TeamUnknown Release
Chapter 4 - Application Instruction Set Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Chapter 4: Application Instruction Set This chapter describes additional resources available to the application programmer with an introduction to the instruction set. Chapter 1 explained that Itanium architecture implements a revolutionary new 64-bit instruction set architecture. This chapter introduces you to the instruction set, assembly language instruction syntax, and the unique notational format used in preparing instructions for parallel execution by the Itanium processor.
INSTRUCTION SET ARCHITECTURE The Itanium processor’s instruction set architecture (ISA) is designed with new features to enhance instruction level parallelism (ILP), increase software performance, and allow the compiler to communicate information to the processor. In earlier chapters, you saw that features such as predication, speculation, loop pipelining, and branch predication increase ILP. Other capabilities that provide special support for software modularity, such as register stack switching, high performance floating-point instructions, and specific multimedia instructions, enable higher performance. Finally, communication between compiler and processor is facilitated by template information contained in the instruction bundles. The template information allows the processor to efficiently assign its functional units to the execution of instructions. All these features are elements of the ISA. The instructions of the assembly language instruction set are another element of the ISA. For orderly presentation, related instructions in the instruction set are arranged in groups, as follows: Memory access instructions Register file transfer instructions Integer computation instructions Character strings and population count instructions
Compare instructions Branch instructions Multimedia instructions The instruction set also includes floating-point instructions; however, they are covered in Chapter 9, which is dedicated to the floating-point programming model of the Itanium architecture.
TeamUnknown Release
Chapter 4 - Application Instruction Set Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
ASSEMBLY LANGUAGE INSTRUCTION SYNTAX Let us first look at the syntax of Itanium processor assembly language instructions. Assembly language statements follow this general format: [qp] mnemonic[.comp]
dest = srcs
where qp mnemonic comp dest srcs
is the qualifying predicate register; is a name that uniquely identifies an IA-64 instruction; is one of the instruction completers; is a destination operand; and is a source operand(s).
Let us next look at the use of each of these fields in more detail. The qualifying predicate (qp) field identifies a predicate register to be used as a qualifying predicate. Remember that the application register set has 64 1-bit predicate registers, identified as Pr0 through Pr63. When an instruction is executed, the value in this predicate register is tested. If the value is True (1), the instruction executes and its result is committed in hardware. That is, the result is retained for use by future instructions. If the value is False (0), the result is not committed and the value is discarded. Not all instructions contain a qualifying predicate; however, one can accompany most instructions. Not all instructions in a program need to be qualified by a predicate. Remember that the state of predicate register Pr0 is hardwired to 1, so to make an instruction’s execution not be conditional on a predicate, use Pr0 in its qualifying predicate field. If no predicate is specified with an instruction, Pr0 is assumed to be the predicate and the execution of the instruction is not conditional. The instruction name (mnemonic) and completers (comp) fields together specify the specific instruction to be used. The mnemonic uniquely identifies the base instruction.
Some examples are add for arithmetic add, or for logical OR, and shr for shift right signed. Completers identify variations on the base instruction. They follow the mnemonic and are normally separated from it by periods. For example, the shift right signed (shr) instruction is changed to shift right unsigned by adding a .u completer. Therefore, the instruction is written as shr.u. Another commonly used completer is .z, which stands for zero. Not all instructions require the use of a completer. For example, no variation of the integer add (add) instruction and the logical OR (or) instruction uses them. Other instructions permit the use of several completers. An example of an instruction with more than one completer is: ld.c.nc Here the base is the load (ld) instruction and it is made to perform a load check with no clear function by adding the completers check(.c) and no clear (.nc). The source operands (srcs) and destination operands (dest) fields identify the register or immediate operands processed by the instruction. Table 4.1 shows the notation used to identify each of the Itanium processor’s register types in assembly language statements. Notice that the general registers are identified with r. For instance, general register 0 (Gr0) is identified in assembly language statements as r0. Not all register types are permitted as operands in all instructions. For instance, integer arithmetic instructions, such as add, only allow use of the general registers. Another example is CPUID register 3 and it is expressed in an assembly language statement as cpuid3. Table 4.1: Register File Notation for Assembly Language Statements Register file Application registers
Assembly mnemonic ar
Branch registers
b
CPU identification registers
cpuid
Floating-point registers
f
General registers
r
Performance monitor data registers
pmd
Predicate registers
p
An example of an instruction is (p1) add r1 = r2,r3 When executed, this instruction adds the contents of source operands, general registers
2 and 3, together and places the sum in the destination register, general register 1. This instruction is qualified by predicate register 1. For this reason, the result of the computation is not saved in r1 unless the state of p1 is True (logic 1) when the instruction is executed. Now, look at how indirect addressing is identified in an assembly language statement. When data are loaded from or stored into memory, it is a common practice to use the value in a general register as the address pointer to the memory location. Enclosing the operand in brackets indicates that a general register holds an indirect address. An example is the instruction: (p0) ld8.s r31 = [r3] Execution of this instruction loads into register Gr31 the quad word value held in the memory location pointed to by the address in register Gr3.
TeamUnknown Release
Chapter 4 - Application Instruction Set Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
INSTRUCTION GROUPS As long as instructions in a program are independent, they can be executed in parallel. Itanium architecture requires the compiler, or assembly code programmer, to analyze the sequential instruction steam of the program to identify those instructions that are dependent and to rearrange the sequence by grouping those that are independent together. A set of contiguous instructions that are independent is called an instruction group. By rearranging the instructions into these groups, the parallel execution resources of the Itanium processor can be used more efficiently.
Types of Dependencies The compiler must identify two types of dependencies: control dependency and data dependency. An example of a control dependency is the occurrence of a conditional branch instruction followed by a load instruction. Since the execution of the load instruction depends on whether or not the branch is taken, the compiler cannot schedule the load instruction earlier by moving it up in the instruction stream. Therefore, the load instruction is considered control dependent on the branch instruction and they must be located in separate instruction groups. A data dependency exists between an instruction that accesses a register or memory location and another instruction that alters the same register or storage location. Table 4.2 summarizes the types of data dependencies that can exist between instructions. The first example, which is called a write-after-write (WAW) dependency, represents two instructions that write to the same register or memory location. If the sequence of these instructions were changed, the values contained in those locations could be incorrect when instructions that use their results are executed. For this reason, they must be in separate instruction groups. The second data dependency listed in Table 4.2 is similar to the WAW dependency, but this time the first data operation is a read. The following instruction sequence is an example of a read-after-write (RAW) dependency:
add sub
r4 r7
= =
r5,r6 r4,r9
Notice that Gr4 is written into by the first instruction, then its contents are read by the second instruction. If the compiler scheduled these instructions in the reverse order, the result produced would be incorrect. Therefore, they form an instruction group boundary. Table 4.2: Types of Data Dependencies Dependency type
Description
Write-after-write (WAW)
A dependence between two instructions that write to the same register or memory location
Read-after-write (RAW)
A dependence between two instructions in which an instruction writes to a register or memory location that is read by a subsequent instruction
Write-after-read (WAR)
A dependence between two instructions in which an instruction reads a register or memory location that a subsequent instruction writes
Ambiguous memory
Dependency between a load and a store, or between two stores, where it cannot be determined if the involved instructions access overlapping memory locations
The third dependency shown in Table 4.2 is the write-after-read dependency. Register WAR dependencies are permitted within an instruction group. The last dependency described in Table 4.2 is the ambiguous memory dependency. This dependency is between a load and a store instruction (RAW or WAR) or two store instructions (WAW). A dependency exists unless it can be assured that the store and load operations or two consecutive store operations do not access the same memory location. For example, in the following instructions, add st4 ld4
r13 = r25,r27 [r29] = r13 ;; r2 = [r3]
the load (ld4) instruction cannot be moved up to be grouped with the add instruction because of an ambiguous memory dependency. That is, it cannot be assured that the addresses in Gr29 and Gr3 do not result in access of the same storage location in memory. The data speculation process can overcome many ambiguous memory dependencies.
Identifying Instruction Groups The segment of program in Figure 4.1 demonstrates instruction groups. Here, the
compiler has arranged the sequential instruction stream into four groups of independent instructions called instruction groups A through D. The syntax requires you to mark the boundary between instruction groups with a double semicolon (;;) as the delimiter. An instruction group can contain as few as one instruction, but has no upward bound. For instance, in the example program in Figure 4.1, instruction groups A has a single instruction and group C contains five instructions. The Itanium processor can simultaneously execute as many instructions from a single instruction group as its execution unit resources allow.
Figure 4.1: Identifying Instruction Groups Consider the dependency that exists between the instructions in groups A and B in Figure 4.1. Comparing the instructions in these two groups, you can see that the add instruction in group A and the move instruction in group B both access a resource identified as r31. If these instructions were executed out of order and the move from group B was performed before the add from group A, the value of r31 may not be correct. This example is a RAW dependency. To produce correct results, a procedure must be put in place to assure that the add instruction from group A gets executed before the move instruction from group B.
TeamUnknown Release
Chapter 4 - Application Instruction Set Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
INSTRUCTION BUNDLE FORMAT AND TEMPLATES In Chapter 2, the explanation of the Itanium processor’s memory address space showed how data and code were stored in memory. The compiler packs instruction code in bundles for storage in memory. A bundle is 16 bytes (128 bits) long, and bundles of instruction code are stored in memory on aligned bundle address boundaries. The code always uses little-endian organization. Within memory, the bytes of a bundle are ordered from lowest to highest address. In this section, you will look inside the bundle to see what information it holds and how this information is organized. The format of an instruction bundle is shown in Figure 4.2(a). Notice that it contains a 5bit template field and three 41-bit instruction slot fields. The template field is located in the five least-significant bit locations and is followed by the fields for instruction slots 0, 1, and 2, respectively. Within a bundle, the instructions in lower-numbered instruction slots precede those in higher-numbered slots. For this reason, the instruction in slot 0 precedes the instruction in slot 1 in the program. Moreover, instructions of a program in a bundle with a lower memory address normally precede those instructions in a bundle with a higher address. Figure 4.2(b) shows how a bundle is stored in memory starting at address 10000H.
Figure 4.2: Instruction Code Bundle Format and Example What is the purpose and content of the template field? An important element of architecture is the ability of the compiler to communicate information about instruction parallelism to the Itanium processor. The template field conveys that information. Remember that the Itanium processor has multiple execution units, and they can be used to execute different instructions at the same time, except when the execution of one instruction depends on execution of another instruction earlier in the program. Inserting an instruction group delimiter identifies the location of these dependencies. The template field contains information about the three instructions in the bundle that enable the processor to decide which instructions can be executed in parallel. First, it tells the Itanium processor the type of execution unit required for the execution of each instruction. Instructions are categorized into one of six types based on their use of execution resources. Table 4.3 lists these instruction types and the corresponding execution unit type. For example, instruction type A stands for integer ALU instruction. During execution, it could require either the I-unit (integer unit) or M-unit (memory unit). On the other hand, type F instructions (floating-point instructions) require just the F-unit (floating-point unit). Table 4.3: Relationship between Instruction Type and Execution Unit Type Instruction type
Meaning
Execution unit type
A
Integer ALU
I-unit or M-unit
Instruction type
Meaning
Execution unit type
I
Non-ALU integer
I-unit
M
Memory
M-unit
F
Floating-point
F-unit
B
Branch
B-unit
L+X
Extended
I-unit/B-unit
The compiler also uses the template to alert the processor to an instruction group boundary within a bundle, using a mechanism known as a stop. Inserting a template with a stop means that the execution of instructions located after the stop depends on resources provided by instructions before the stop. When the Itanium processor identifies a template that contains a stop, the processor resolves the resource dependency before processing the instruction or instructions. Thus, a stop can be specified at the end of any bundle or within a bundle by using one of these special template types that implicitly include inter-bundle stops. The ISA defines standard templates for the compiler to use to mark bundles of code for execution by the Itanium processor. They are listed in Table 4.4. Notice that if the 00 template is used in a bundle, the instruction in slot 0 requires the memory unit and those in slots 1 and 2 need integer units. For this reason, this template is called type MII, which stands for memory-integer-integer. In Table 4.4, two vertical parallel lines are used in a template to identify the location of a stop corresponding to an instruction group boundary. For instance, template 02 has a stop after instruction slot 1. Using this template, the compiler tells the processor that the instruction in slot 2 might depend on data being manipulated by instructions in the previous instruction group, which contains the instructions in slot 0 and slot 1. This template is denoted as type MI_I for memory-integer-stop-integer. Template 0A, type M_MI, is similar except that execution of the instructions in the group that contains the instructions in slots 1 and 2 depends upon results produced by the instruction group that contains the instruction in slot 0. Table 4.4: Template Field Encoding and Instruction Slot Mapping Template
Type
Slot 0
Slot 1
Slot 2
00
MII
M-unit
I-unit
I-unit
01
MII
M-unit
I-unit
I-unit ||
02
MI_I
M-unit
I-unit ||
I-unit
03
MI_I
M-unit
I-unit ||
I-unit ||
Template
Type
Slot 0
Slot 1
Slot 2
04
MLX
M-unit
L-unit
X-unit
05
MLX
M-unit
L-unit
X-unit ||
06
Reserved
07
Reserved
08
MMI
M-unit
M-unit
I-unit
09
MMI
M-unit
M-unit
I-unit ||
0A
M_MI
M-unit ||
M-unit
I-unit
0B
M_MI
M-unit ||
M-unit
I-unit ||
0C
MFI
M-unit
F-unit
I-unit
0D
MFI
M-unit
F-unit
I-unit ||
0E
MMF
M-unit
M-unit
F-unit
0F
MMF
M-unit
M-unit
F-unit ||
10
MIB
M-unit
I-unit
B-unit
11
MIB
M-unit
I-unit
B-unit ||
12
MBB
M-unit
B-unit
B-unit
13
MBB
M-unit
B-unit
B-unit ||
14
Reserved
15
Reserved
16
BBB
B-unit
B-unit
B-unit
17
BBB
B-unit
B-unit
B-unit ||
18
MMB
M-unit
M-unit
B-unit
19
MMB
M-unit
M-unit
B-unit ||
1A
Reserved
1B
Reserved
1C
MFB
M-unit
F-unit
B-unit
1D
MFB
M-unit
F-unit
B-unit ||
1E
Reserved
1F
Reserved
Vertical parallel lines identify the location of a stop corresponding to an instruction group boundary. The instruction set architecture provides twelve different templates, and each has two versions: one without a stop at the end and the other with a stop. If the instruction
sequence does not fit one of these templates, the compiler will select the best fit, then pack one or two instruction slots with no operation (nop) instructions. The no operation instructions are marked to match the type of functional unit that is expected for the slot. For example, the memory nop and integer nop are denoted as nop.m and nop.i, respectively. If an instruction sequence fits the MII template, but the program sequence contains only one integer instruction, not two, slot 3 is filled with a nop.i instruction. How does one template differ from another? Comparing the format of template 02 with template 03 in Table 4.4, we see that 03 has two stops instead of one. One stop is located after instruction slot 1 and another at the end. Together, they represent the two versions of the type MI_I template. In this way, an application program consists of a sequence of instructions and stops packed in bundles. There is no correlation between instruction groups and the beginning or end of bundles. A large instruction group can span several bundles and start or end anywhere within a bundle. Figure 4.3 illustrates this idea. Notice that both bundles 2 and 3 hold instructions from instruction group C.
Figure 4.3: Bundle and Instruction Group Boundaries The code sequence and its corresponding bundles of instructions in Figure 4.4 demonstrates both the partitioning of an instruction group between bundles and the use of nop instructions to complete bundle templates. Bundle 1 contains only the move instruction from the first instruction group and is filled with a floating-point nop and memory nop to form an MFI template. The second bundle contains instructions from both instruction groups.
Figure 4.4: Instruction Groups and Corresponding Bundles
TeamUnknown Release
Chapter 4 - Application Instruction Set Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
SYLLABLE OF INSTRUCTIONS The compiler or assembler organizes instruction groups into bundles for execution by the Itanium processor. The assembler allows bundling to be either automatically performed by the assembler or done by hand by the programmer. Figure 4.5 shows how a bundle is typically expressed. Notice that each bundle of instructions is enclosed in curly braces and contains a template specification and three instructions. The example contains 3 syllables, namely memory (M) or A-type, I-types, and I-type. { .mii ld4 r28 = [r8] add r9 = 2,r1 add r30 = 1,r1 } Figure 4.5: Syllable of Code
TeamUnknown Release
// Load a 4-byte value // 2+r1 and put in r9 // 1+r1 and put in r30
Chapter 5 - Memory Access and Register Transfer Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Chapter 5: Memory Access and Register Transfer Instructions The Itanium architecture requires all memory-to-register information transfers to take place between a storage location in memory and a register in the general register or floating-point register files. For this reason, a general register or floating-point register must act as an intermediary for transfers of data between memory and any other internal registers, such as the predicate registers, branch registers, or application registers. This chapter explains the instructions that facilitate transfer of information between memory and registers. These instructions are grouped as the memory access instructions and register transfer instructions.
MEMORY ACCESS INSTRUCTIONS Memory access instructions can only process information in the general and floatingpoint register files. The instructions that are used to load data from memory, save data in memory, or modify information directly in memory are called the memory access instructions. This group includes the load, store, and semaphore instructions. The first part of this chapter examines the operation of the instructions in each of these subgroups with respect to the general registers. Their use with the floating-point registers will be covered in a later chapter.
TeamUnknown Release
Chapter 5 - Memory Access and Register Transfer Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
LOAD INSTRUCTION The load instruction transfers data from a source operand in memory to a destination operand that is a general register. The base mnemonic for the instruction, which is ldsz, and its general formats are shown in Table 5.1. Table 5.1: Load Instruction Formats Mnemonic
Operation
Format
ldsz
Load
(qp) ldsz.ldtype.ldhint r1=[r3] (qp) ldsz.ldtype.ldhint r1=[r3],r2 (qp) ldsz.ldtype.ldhint r1=[r3],imm9 (qp) ld8.fill.ldhint r1=[r3] (qp) ld8.fill.ldhint r1=[r3],r2 (qp) ld8.fill.ldhint r1=[r3],imm9
The standard form of the load instruction is given by this format: (qp) ldsz.ldtype.ldhint r1=[r3] When executed, this instruction loads the element of data in the memory location indicated by the address in the general register identified as r3 into the general register corresponding to r1. The contents of a general register specify the address of the storage location in memory that is to be accessed. This is an example of indirect addressing. The size (sz) part of the load mnemonic tells the size of the data element. Table 5.2 lists the permitted values for sz and corresponding number of bytes involved in the data transfer. Notice that the values, 1, 2, 4, and 8 correspond to the byte, word, double word, and quad word integer data types, respectively. For instance, if the mnemonic is written as ld1 a single byte of data is read from memory into the least significant byte of the general register selected for r1. On the other hand, specifying ld8 results in 8 bytes (64
bits) of data to be loaded into the specified register. For sizes less than 8 bytes, the value loaded into the general register is zero-extended to 64 bits. Table 5.2: sz Completers Size completer
Bytes accessed
Data type
1
1 byte
Byte
2
2 bytes
Word
4
4 bytes
Double word
8
8 bytes
Quad word
The instruction mnemonic supports several completers that further define the data transfer that will take place. The first completer is labeled ldtype and stands for load type. You would use this completer to specify special instruction variants that support operations such as control speculation and data speculation. Table 5.3 lists the seven different values allowed for ldtype and identifies the type of load operation they represent. For instance, the .s completer is used to write a speculative load instruction, and the .a makes an advanced load instruction. So, using both s and a together (.sa) specifies a speculative advanced load instruction. The supported memory load instruction types are summarized in Table 5.4. Table 5.3: Load Type Completers ldtype
Meaning
Special load operation
None
Normal load
s
Speculative load
Certain exceptions may be deferred rather than generating a fault. Deferral causes the target register’s NaT bit to be set. The NaT bit is later used to detect deferral.
a
Advanced load
An entry is added to the advanced load address table (ALAT). This allows later instructions to check for colliding stores. If the referenced data page has a nonspeculative attribute, the target register and NaT bit is cleared, and the processor ensures that no ALAT entry exists for the target register. The absence of an ALAT entry is later used to detect deferral or collision.
sa
Speculative advanced load
An entry is added to the ALAT, and certain exceptions may be deferred. Deferral
ldtype
Meaning load
Special load operation exceptions may be deferred. Deferral causes the target register’s NaT bit to be set, and the processor ensures that no ALAT entry exists for the target register. The absence of an ALAT entry is later used to detect deferral or collision.
c.nc
Check load (no clear)
The ALAT is searched for a matching entry. If found, no load is done and the target register is unchanged. Regardless of an ALAT hit or miss, base register updates are performed, if specified. An implementation may optionally cause the ALAT lookup to fail independent of whether an ALAT entry matches. If not found, a load is performed, and an entry is added to the ALAT (unless the referenced data page has a nonspeculative attribute, in which case no ALAT entry is allocated).
c.clr
Check load (clear)
The ALAT is searched for a matching entry. If found, the entry is removed, no load is done and the target register is unchanged. Regardless of an ALAT hit or miss, base register updates are performed, if specified. An implementation may optionally cause the ALAT lookup to fail independent of whether an ALAT entry matches. If not found, a clear check load behaves like a normal load.
c.clr.acq
Ordered check (load/clear)
This type behaves the same as the unordered clear form, except that the ALAT lookup (and resulting load, if no ALAT entry is found) is performed with acquire semantics.
acq
Ordered load
An ordered load is performed with acquire semantics.
bias
Biased load
A hint is provided to the implementation to acquire exclusive ownership of the accessed cache line. Table 5.4: Supported Memory Load Instructions Mnemonic
Operation
Mnemonic ld
Operation
ld.s
Speculative load
ld.a
Advanced load
ld.sa
Speculative advanced load
ld.c.nc,ld.c.clr
Check load
ld.c.clr.acq
Ordered check load
ld.acq
Ordered load
ld.bias
Biased load
ld8.fill
Fill load
Normal load
The second completer element used with the base load instruction is called a load hint and it is identified in the general format as ldhint. Load hints permit control of the memory/cache subsystem. The value selected for ldhint specifies the locality of the memory access and those levels of memory hierarchy that are affected by the memory access. In this way, it implies an allocation path through the hierarchy of the memory subsystem. Table 5.1 shows that all formats use the ldhint completer. The choices for ldhint are given in Table 5.5. For example, the .nta completer stands for no temporal locality, all levels. The load instruction uses this form when the load from main memory to the specified general register will take place through all levels of the nontemporal structure of the cache memory. Table 5.5: Load and Store Hint Completers Hint
Meaning
Instructions applied to
None
Temporal locality, level 1
ldsz, stsz, xchgsx, cmpxchgsz, fetchadd
nt1
No temporal locality, level 1
ldsz, xchgsx, cmpxchgsz, fetchadd
nta
No temporal locality, all levels
ldsz, stsz, xchgsx, cmpxchgsz, fetchadd
What type of load instruction is written as ld2.nt1 r10=[r15]?What operation does it perform? This instruction represents a normal load of word data with no temporal locality, level 1. When executed, the word value in the memory location pointed to by the address held in Gr15 is read from memory, zero-extended to 64 bits, and loaded into Gr10. The second format for the load instruction is: (qp) ldsz.ldtype.ldhint r1=[r3],r2
This form differs in the way the address of the operand in memory is specified. In this format, two address elements are specified. The first element is the base address that is specified by the value in general register r3. During execution of the instruction, the value in r3 is used as the address of the source operand. After the memory location has been accessed, the value in r2 is added to the base address in r3and the updated value is placed in r3. Since the base address update is done after the load operation is complete, the update operation is a post-increment. In this way, the base address automatically points to the next element of data in memory. This addressing technique is known as base-address-register update addressing. The instruction format that follows performs the same operation, except that the value that is added to post-increment the base address is specified by immediate operand imm9. (qp) ldsz.ldtype.ldhint r1=[r3],imm9 In this third format, an immediate value is added to the base address, and the result is placed back in the base address register. In the second format, the contents of a general register are the update information. Example Suppose that the values in Gr5 and Gr10 are F000000100000000H and 0000000000000008H, respectively, when the instruction (p0) ld8 r20=[r5],r10 is executed. What type load operation is performed, what is the size of the data element to be loaded from memory, and what is the address of the next storage location to be accessed? Solution Since the load instruction mnemonic is appended with the value 8 and no completers are added, the instruction represents a normal load of a quad word to Gr20. After the memory access takes place, the base address in Gr5 is post-incremented by the value in Gr10 to give: Gr5 = F000000100000000H + 0000000000000008H = F000000100000008H
Load Types
In addition to the normal or nonspeculative load, the load instruction can perform speculative loads, advanced loads, speculative advanced loads, check loads, ordered loads, and biased loads. The compiler will specify the appropriate type of load instructions based on the operations they must perform. Table 5.3 provides a summary of each of these load operations. Remember that control speculation enables a load instruction to be executed before its results are needed, and if it turns out that the instruction does not actually need to be executed, these results are simply discarded. The key difference between the nonspeculative-load (ld) instruction and speculative-load (ld.s) instruction is the way exceptions are serviced. If an exception occurs when a normal load instruction is executed, a fault is immediately generated to request service for the exception. For this reason, it is unsafe to schedule a normal load instruction before it is known that it will be executed. A normal load instruction can be rewritten to perform a speculative load, as follows: (p0) ld8.s r20=[r5],r10 This instruction can be used in situations where an exception may occur, but it is not known that the instruction causing the exception will be executed. The ld.s instruction defers exceptions. That is, if an exception occurs during the load operation, a fault is not raised to request service. This deferral makes it safe to schedule a speculated load instruction before knowing whether or not it will actually be executed. For this reason, the speculative load instruction is used to implement instruction sequences that employ control speculation. The NaT bit associated with a general register is the mechanism used to mark the existence of a deferred exception for the value in a general register. If an exception occurs when a speculative load instruction is executed, the NaT bit for the destination general register is set to 1, indicating that the speculative load has failed and that the value of data in the corresponding general register is invalid. How does the NaT bit impact the execution of a non-speculative load instruction? An earlier example of a normal load instruction was: (p0) ld8 r20=[r5],r10 When this instruction is executed successfully, the specified load operation is performed, then the NaT bit associated with its destination operand (Gr20) is cleared. This cleared register signals that an exception has not been deferred. However, if the NaT bit associated with base address registers (Gr5) is set when the instruction is executed, the load operation is not performed; instead, a register NaT consumption fault occurs. This exception is immediately raised for service and the NaT bit for the destination operand (Gr20) is again cleared. On the other hand, if the NaT bit corresponding to address
increment register (Gr10) is set to 1 when the load instruction is executed, the NaT bit corresponding to the source operand general register (Gr5) gets set to 1 and no fault is raised. This result signals that the exception has been deferred. If the NaT bit associated with the general register that holds the address increment used in the base register update calculation of an ld.s instruction is 1, the NaT bit corresponding to the base address register is set to 1 and no fault is raised. Ambiguous memory dependence occurs between the store and load when the compiler cannot determine whether or not the load and store operations access the same memory location. The advanced load instruction resolves the ambiguous reference that could exist in an instruction sequence that consists of a store instruction followed by a load instruction. The advanced load instruction implements a process called data speculation that enables the compiler to move the load instruction ahead of the store instruction in the instruction sequence. Consider this example of an advanced load instruction: (p0) ld8.a r20=[r5],r10 When this advanced load operation is performed, the processor computes an entry, which is called a tag, and places that tag in a structure called the advanced load address table (ALAT). This ALAT register entry uniquely identifies the destination register Gr20. The presence of this entry in the ALAT indicates that the data speculation succeeded and that the value in Gr20 is valid. In the data speculation process, the check load instruction used to confirm whether or not a speculative load is successful by checking for its tag in the ALAT table is (p0) ld.c.nc or ld.c.clr If the data speculation failed (no tag is found), the check load instruction also reinitiates the load operation to recover. Check load is used when only the advanced load instruction itself is scheduled before the store that is ambiguous relative to the advanced load. For instance, a check load instruction that can be used to complete the data speculation instruction sequence initiated by our earlier example of the advanced load instruction is (p0) ld8.c.clr r20=[r5],r10 The check load instruction must have the exact same operand combination as its associated advanced load instruction. A completer is used to implement clear (.clr) and no clear (.nc) versions of the check load instruction. After reading a tag in the ALAT table, the ld.c.clr instruction causes the tag to be cleared. The compiler uses this form when it knows that the ALAT entry will
not be used again. The ld.c.nc version differs in that it leaves the entry in the table. Using these completers, the software indicates whether or not entries are needed, making it less likely that a useful entry will be unnecessarily forced out of ALAT because all of its entry locations are already in use. For a speculative advanced load (.sa) instruction, the operation is similar to the advanced load in that when executed, the instruction enters a tag into the ALAT register. However, it also defers exceptions. Whenever a deferred exception occurs during a speculative advanced load operation, the NaT bit of the general register for the destination operand is marked and the entry of the tag into the ALAT is suppressed. Therefore, the speculative advanced load instruction can be used to enable the early scheduling of a load operation that is both control and data speculative. Example What type of operation is performed by the following instruction? (p0) ld8.sa r20=[r5],08H Solution This instruction initiates a combined control/data speculation instruction sequence by performing an early load of 64 bits of data into destination register Gr20 from the memory location pointed to by the address in register Gr5. The address of the source operand is post-incremented by 8H after the memory read is performed.
Itanium architecture memory accesses follow one of four memory ordering semantics: unordered, release, acquire, and fence. Memory ordering semantics define the method by which the effects of a particular memory instruction become visible when compared to other memory instructions. Visibility in the context of memory ordering refers to the tangible effects that a memory instruction will have in the processor. For example, a load instruction may become visible when the caches receive a request to lookup the data being requested by the instruction. The acquire (.acq) completer on the load instruction ld.acq specifies what is called an ordered load, so that the memory access operation uses acquire semantics. Enforcing acquire semantics on a memory access guarantees that the accessed value is made visible prior to all subsequent data accesses. The ordered check load (ld.c.clr.acq) instruction performs the same function as ld.c.clr. However, it is used to check the ALAT register to determine whether an ordered load was successful. If the ordered load has failed, this instruction initiates a reload to recover. The reload is performed with acquire semantics.
Load Hints As mentioned earlier, the load hint completer specifies whether data being accessed in memory has temporal or nontemporal locality. It also indicates the levels of memory hierarchy that are affected by the access. Temporal locality and memory hierarchy is based on the Itanium processor’s architectural view of data memory shown in Figure 5.1. This memory subsystem assumes three levels of cache memory between the general register file and main memory. Each level of cache memory is partitioned into two parallel structures: a temporal and a nontemporal structure. The existence of separate temporal and nontemporal structures and the number of levels of cache is specific to the processor’s implementation of the architecture. Figure 5.1 shows the cache model that is implemented for the Itanium processor.
Figure 5.1: Architectural View of Memory Temporal structure cache memory is accessed with load instructions that contain a temporal locality hint. As shown in the list of load hints in Table 5.5, you do not append a load hint to the instruction mnemonic. On the other hand, specifying nontemporal locality with either the .nt1 or .nta hint accesses nontemporal structure cache memory. Locality hints do not affect a program’s functionality, but they do impact its performance during execution. A locality hint implies a particular allocation path in the memory hierarchy. This allocation path specifies the cache memory structures to allocate for the line containing the referenced data. The allocation path corresponding to each load hint is marked in Figure 5.1. An access that is temporal with respect to level 1 is treated as temporal with respect to all lower (higher-numbered) levels, as shown in the temporal, level 1 path in Figure 5.1. Hinting that an element of data should be cached in a temporal structure implies that it is likely to be read in the near future. An access that is nontemporal with respect to a given hierarchy level is treated as temporal with respect to all lower levels. For example, specifying the .nt1 hint for a data load indicates that the data should be cached at level 1, and that it is nontemporal at level 1 and temporal to levels 2 and 3. Subsequent loads for this memory location can be specified with the .nta hint and have the data remain at level 1. That is, if the line of data is already at the same or a higher level in the memory hierarchy, no movement occurs.
Load Fill Operation For the speculative load instruction, the value in a general register’s NaT bit plays an important role in the control speculation process. For this reason, you want to save this bit along with the value in its corresponding general register when saving to memory. This operation is known as a spill. The value of the NaT bit is actually saved in the user NaT collection (UNAT) register, which is an application register. The process of restoring the values of both the general register from memory and its NaT bit from the UNAT register is called a fill. A special form of the load instruction is used for reloading a spilled register and NaT pair. This is the load fill (ld8.fill) instruction and its general formats are shown in Table 5.1 as: (qp) ld8.fill.ldhint r1=[r3] (qp) ld8.fill.ldhint r1=[r3],r2 (qp) ld8.fill.ldhint r1=[r3],imm9 Notice that sz is always 8; therefore, a full 64-bit value is always loaded. In these three ld8.fill formats, the operand variations are identical to those already described for the general load instruction. The first format permits the loading of the value of the source operand in memory, which is pointed to by the base address in the general register specified by r3, into the destination register that is identified by r1. The other two formats permit a post increment of the base address in r3 by either the value in the general register specified by r2 or immediate operand imm9. When the 8-byte value is loaded into a destination register, the value in the corresponding bit in the UNAT application register is copied into the register’s NaT bit. Example What operation is performed by the following instruction? (p0) ld8.fill r20=[r10],8H Solution When this instruction is executed, the value held in memory at the address in Gr10 is copied into Gr20 and at the same time NaT20 is filled with the value in the bit location of the UNAT register selected by bits 8 through 3 of the memory address. After the data transfer is complete, the address in Gr10 is incremented by 8H.
TeamUnknown Release
Chapter 5 - Memory Access and Register Transfer Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
STORE INSTRUCTION The store instruction saves to memory the data held in a general register. Store instructions can also be used to save information held in floating-point registers, but that type of operation is explained in Chapter 9. Table 5.6 identifies the base mnemonic and general formats of the store instruction. Just like the load operation, sz is appended to the store mnemonic to specify the size of the data transfer taking place. Stores are defined with the same access sizes as supported for loads. These sizes are listed in Table 5.2. Table 5.6: Store Instruction Formats Mnemonic
Operation
Format
stsz
Store
(qp) stsz.sttype.sthint [r3]=r2 (qp) stsz.sttype.sthint [r3]=r2,imm9 (qp) st8.spill.sthint [r3]=r2 (qp) st8.spill.sthint [r3]=r2,imm9
The two forms of the stsz instruction perform the reverse operation of their load counterparts. For instance, the register form is used to write the value in the general register specified by source operand r2into the destination memory location pointed to by the base address in the general register identified by r3, as follows: (qp) stsz.sttype.sthint [r3]=r2 If the NaT bit corresponding to either r2 or r3 is set to 1 when the instruction is executed, a register NaT consumption fault occurs. Both store type and store hint completers can be added to the base mnemonic of the store instruction. The supported sttype completers are listed in Table 5.7. A normal store operation is specified by not including a type completer with the store mnemonic. Notice that store operations are always nonspeculative. The only special type of store operation available is the ordered store and it is specified by using the release (.rel)
completer. The ordered store (stsx.rel) instruction causes an ordered store of data into memory using release semantics. Table 5.7: Store Type Completers sttype
Meaning
None
Normal store
rel
Ordered store
Special store operation Performed with release semantics
The different forms of the store instruction are shown in Table 5.8. Table 5.8: Supported Memory Store Instructions Mnemonic st
Operation
st.rel
Ordered store
st.spill
Spill store
Store
Just as for the load instruction, the value of the sthint completer specifies the locality of the memory access. Table 5.5 shows that only "temporal locality, level 1" and "no temporal locality, all levels" apply to the store instructions, and that they are chosen by appending the instruction mnemonic with no completer or .nta, respectively. An example of a normal store instruction is: (p0) st2 [r15]=r10 When executed, the least significant word of data in Gr10 is written to the storage location in memory pointed to by the address in Gr15. Since no store hint is provided, temporal locality, level 1, is selected for the store operation. The compiler makes this selection if it knows that the data is likely to be reloaded fairly soon. Looking at Figure 5.1, we see that this means that the value should be cached in the level 1 temporal structure of the cache memory and the allocation path is from register file through all of the temporal levels of the cache to main memory. Store instructions can also specify the base-address-register update form of addressing introduced with the load instruction, but only by an immediate value. The second store instruction format in Table 5.6 implements this type of addressing. When this instruction is executed, the processor writes the value in the general register specified by r2to the memory location pointed to by the address in the general register identified by r3. After the write operation is complete, the base address in r3 is post-incremented by the signed immediate value imm9. In this way, it points to a new storage location in memory. Example
If the values in Gr15 and Gr10 are 10000000F0001238H and FFFFFFFF01234567H, respectively, what operation is performed by the following instruction? (p0) st8 [r15]=r10,8H Solution When the instruction is executed, the quad word FFFFFFFF01234567H in Gr10 is saved in memory at the address 10000000F0001238H that is held in Gr15. After the write to memory is complete, the value in Gr15 is post-incremented by 8H to give 10000000F0001240H. Now, Gr15 points to the next contiguous quad word storage location in memory.
A spill operation is used to save a general register and NaT bit pair. Just like for the fill operation, a special instruction is defined to perform this operation. The general format of this store instruction is given in Table 5.6 as: (qp) st8.spill.sthint [r3]=r2 Execution of the store spill instruction saves the value of the 8-byte source operand in the general register designated by r2 to the memory location pointed to by the address in the general register specified by r3, and the NaT bit corresponding to r2 into the UNAT register. The following instruction employs the second form of the store spill instruction given in Table 5.6: (p0) st8.spill [r10]=r20,8H It performs the same store operation and post-increments the address in general register Gr10 by the immediate value 8H.
TeamUnknown Release
Chapter 5 - Memory Access and Register Transfer Instructions Itanium Architecture for Software Developers byWalter Triebel Intel Press 2000
Recommend this title?
SPECULATION Having been introduced to the load and store instructions and how they operate, you might be wondering how they apply to speculation. Since the loading of data from memory takes several clock cycles, load and store operations can delay the execution of instructions. For instance, suppose a load instruction is located just before a move, add, or AND instruction. If the instruction that follows the load instruction is supposed to use the data that is loading, a delay in processing occurs because the instruction must wait for the data load to be completed. This delay is referred to as memory latency. To avoid potential delay, the compiler attempts to move load instructions earlier in the instruction steam, so the data can be loaded before it is needed for processing, thereby hiding the memory latency. The compiler analyzes the code for an application to determine whether or not the load instructions have a dependency. If a load is confirmed to have no dependencies, the compiler implements the instruction as a nonspeculative load and is free to move the instruction ahead in the program. You might remember from Chapter 4 that when a memory access instruction, such as the load instruction, has a dependency, it cannot be moved earlier in the stream of instructions. Therefore, memory dependencies are an important cause of memory latency. So, the compiler for the Itanium processor uses the technique of speculation to reduce memory latency. Speculation may be used to overcome the dependency of the load instruction and allow it to be scheduled earlier in the instruction sequence. By moving the load instruction up in the instruction stream and executing it early, data is available in a register when needed for processing. Resolving load dependencies improves software performance by increasing
parallelism and hiding latency. Not all speculative loads are successful, and if the load is unsuccessful, the value of data is invalid. For this reason, the instruction set architecture includes two extra operations: A check operation that confirms that the data is valid. A recover operation corrects for speculative loads that have failed. The recover from a failed speculative load has an adverse affect on performance. For this reason, the compiler only implements those load instructions that it knows will rarely experience a dependency as a speculative load.
Control Dependencies and Control Speculation One type of load dependency is caused by a branch in program flow, called a control dependency . Figure 5.2 (a) shows an instruction sequence that demonstrates this situation. This sequence is known as a branch barrier condition . Notice that the load instruction is located just after the branch, and the information that is loaded is used immediately after the load takes place. If the instructions are executed in this sequence, execution will be delayed due to the memory latency from the load operation. The compiler for the Itanium processor uses a mechanism called control speculation to move a load instruction above a branch instruction. Figure 5.2 (b) shows how the compiler rearranges the instruction sequence to eliminate the branch barrier. Notice that the load from memory, which is performed with a speculative load (ld8.s ) operation, is moved above the branch and might actually be performed in parallel with instructions 1 and 2 or even earlier. This change ensures that the data will be loaded into the appropriate register before it is needed, thereby hiding the latency of the memory load operation. In some cases, the instruction or instructions that use this data can also be moved above the branch instruction.
Figure 5.2: Branch Barrier Removal with Control Speculation A speculation check (chk.s ) instruction is placed in the instruction stream at the original location of the load instruction. This check operation confirms that the data is still valid just before it is to be used. If the data is found to be valid, the control speculation is successful and the instructions that follow the check instruction can process the data. On the other hand, if the check fails, control is passed to a recovery routine that reinitiates the load of data from memory. The check instruction does not take any additional clock cycles unless the data is found to be invalid. The compiler may not perform this optimization technique on all occurrences of a branch barrier in an application program. It is used only when the compiler determines that an actual branch is highly unlikely, so the results produced by the instructions associated with the load operation are very likely needed. When the load instruction, and possibly its associated instructions, are moved above the branch, they are executed even though the execution flow might never reach this part of the program. In those cases where the calculations turn out not to be needed, their results are simply discarded.
So, how does the control speculation check process work? The speculation check instruction tests to determine whether a deferred exception has occurred during the execution of the load instruction in the control speculation instruction sequence. Exceptions associated with control-speculative loads are related to events, such as a page fault, and they actually are uncommon in correctly written code. As mentioned earlier, a controlspeculative load does not raise a fault when an exception takes place during its execution. Instead, the exception is deferred. Also, the NaT bit associated with the destination general register in the speculative load instruction is set to 1 when a deferred exception occurs, indicating that the value of data held in the general register is not valid. A set NaT bit is called a token . Only those instructions that perform a speculative load operation can create a token. During execution of the sequence of instructions being performed with control speculation, a set token is propagated through the instructions. When an instruction, such as a move, add, or AND instruction, in a speculative instruction sequence reads a source register whose NaT bit is set, the instruction passes the deferred exception token into the NaT bit for its destination register. Thus, a chain of instructions can be executed speculatively, and only the final result destination register need be checked for a deferred exception token to determine whether or not an exception has occurred. As mentioned earlier, a speculation check (chk ) instruction is placed at the end of the control speculation instruction sequence to determine whether or not it was successful. Table 5.9 shows that the control speculation form of the check instruction is coded with the .s completer. Looking at the format of the instruction, you can see that r2 identifies the destination general register of the last instruction in the sequence performed by control speculation. If the token in the NaT register identified by r2 is 0, the results produced by the control speculative execution sequence are correct, and they are used to update the application state. chk Check (qp) chk.s r2, target25 (qp) chk.s.i r2, target25 (qp) chk.s.m r2, target25 (qp) chk.a.aclr r1, target25
Table 5.9: Speculation Check Instruction Formats Mnemonic
Operation
Format
If the NaT is set to 1, the execution of the speculative sequence has failed. In this case, the results do not update the application state; instead they are discarded. Then, the processor
initiates a branch to a recovery routine. The starting point of the recovery routine is defined by the operand identified as target25 . The value of target25 is encoded into the instruction as a signed immediate (imm21 ) operand, and it represents the displacement between the bundle containing the check instruction and the bundle that contains the first instruction of the recovery routine. The compiler can encode the control speculation check instruction to be executed by either an I-unit or an M-unit by adding the .i or .m completer, respectively. The following instruction sequence illustrates a branch barrier that may be resolved by the use of control speculation. (p1) br.cond some_label ld8 r1=[r5] ;; add r2=r1,r3
// Cycle 0 // Cycle 1 // Cycle 3
Notice that the load instruction that follows the branch has Gr1 as its destination register. The add instruction that uses the result of this load uses Gr1 as its source operand and has a RAW dependency with respect to the load instruction. For this reason, these three instructions must be in separate instruction groups, which is the reason they are shown as executed in different clock cycles. Moreover, the add operation is delayed by a clock period (cycle 3) due to the memory latency of the load operation. This code can be rewritten using a control-speculative load and speculation check, as follows. ld8.s r1=[r5]
;;
(p1) br.cond some_label chk.s r1,recovery add r2=r1,r3
// // // // //
Cycle Other Cycle Cycle Cycle
-2 instructions 0 0 0
Here the load of register Gr1 has been moved up in the instruction stream where it is speculatively executed two clock periods before the branch instruction. A speculation check of Gr1 has been inserted before the add instruction. In this way, the data will be present in Gr1 before the add instruction is executed. The branch, speculation check, and add instructions are all shown as being executed in the same clock cycle. However, the results produced by the control speculation sequence do not update the application state unless two conditions occur: the branch to some_label is not taken, and the speculation check does not detect the occurrence of a deferred exception. If the branch is not taken, but a deferred exception is detected, the application state also is not updated and the check instruction passes program control to the service routine pointed to by the label recovery .
Data Dependencies and Data Speculation The data dependency is a type of load dependency that is caused by ambiguous memory references. This potential conflict between two memory access instructions occurs when they could reference the same memory location. A sample instruction sequence is shown in Figure 5.3 (a). Here a load instruction follows a store instruction that might access the same location in memory. The wrong value of data could be loaded if the compiler moves the load operation above the store operation in the instruction sequence and the instructions do in fact access the same memory location. This condition is called a store barrier .
Figure 5.3: Store Barrier Removal with Data Speculation Data speculation is the process the compiler uses to overcome the store barrier condition. Using data speculation, the compiler can schedule a data load from memory prior to a potentially conflicting store of data to memory. Figure 5.3 (b) shows how data speculation is implemented at a store barrier. By using the advanced load (ld8.a ) instruction, the speculative data load takes place before the execution of the potentially conflicting store
(st8 ) instruction. Again, the load actually might be performed in parallel with independent instructions 1 and 2, and it should happen earlier than the use, so that the data is available in the destination register when needed. The check operation determines whether or not a conflict occurred between the speculative load and the intervening store, providing a mechanism to recover if the speculation failed. The check load (ld8.c ) instruction is placed in the instruction stream at the original location of the load instruction. If the store has not affected the same memory location as the load, the speculative load is successful. Therefore, the data is valid and can be used by the instructions that follow the check load instruction. If the data speculation fails, the check load instruction reloads the data to recover. Remember, the check load instruction can be used to recover from a failed data speculation only when the load is the only dependent instruction moved above the store barrier. As for control speculation, some of the dependent instructions that follow the load instruction possibly could be moved above the store barrier. In this case, the advanced check (chk.a ) instruction must be used to perform the check and recovery operations. The format of this instruction is provided in Table 5.9 . Just like the speculation check instruction, the advanced check instruction passes program control to a recovery routine located by immediate operand target25 if the data speculation fails. When executed, the advanced load allocates an entry corresponding to the destination general register of the advanced load instruction in a structure called the advanced load address table (ALAT). The load address, the load type, and the size of the load are recorded at this location within the ALAT register. This ALAT entry is called a tag . If the store operation accesses the same address in memory, the tag is invalidated; otherwise, it remains valid. Before the result of an advanced load can be used by any nonspeculative instructions, a check operation must be performed. The check load instruction must specify the same register number, address, and operand size as the corresponding advanced load. When executed, the check operation searches the ALAT for a matching entry. If the entry is found, an ALAT hit, the data speculation is successful. At this point, the ld.c.clr version of the instruction causes the tag to be cleared from the ALAT. The ld.c.nc instruction leaves the entry in the table. In either case, the value loaded into the destination register is valid. Therefore, the results of the data-speculative load are used to update the application state and processing continues with the instruction following the load check instruction. If a matching entry is not found in the table, an ALAT miss, the speculation is unsuccessful and the value loaded into the destination register is invalid. The results of speculative execution are not used to update the application state. Instead, these results must be reloaded. The check load instruction also reloads the data from memory. As for control speculation, the compiler may not choose to implement the data speculation optimization for all occurrences of the store barrier condition. The decision whether or not to
use data speculation depends on the probability that an intervening store will invalidate the advanced load, the impact on code size and performance of the recovery software, and the limited tag capacity of the ALAT register. For example, if an advanced load is issued and the ALAT contains no unused entries, the hardware could choose to invalidate an existing entry to make room for the new one. This action might cause a different data speculation operation to fail and the resulting overhead needed to recover could actually decrease performance. In this way, we see that the compiler’s decision to perform a nonspeculative to data-speculative code transformation is highly dependent upon the likelihood of the data speculation being unsuccessful and the cost of recovering from it. In the nonspeculative instruction sequence that follows, the load instruction and store instruction, both identified for execution in clock cycle 0, could access conflicting memory addresses. st8 ld8 add st8
[r4]=r12 // Cycle r6=[r8] ;; // Cycle r5=r6,r7 ;; // Cycle [r18]=r5 // Cycle
0: ambiguous store 0: load to advance 2 3
If executed in the order shown, the result in Gr6 is not immediately available for processing by the add instruction. For this reason, the add operation is shown to be delayed by one clock cycle (cycle 2) to accommodate for the memory latency of the load operation. Data speculation can resolve this ambiguous memory reference data dependency. One approach would be to just move the load operation above the store by using advanced load and check load operations, as shown in this data speculation code sequence: ld8.a
r6=[r8]
st8 ld8.c add st8
[r4]=r12 r6=[r8] r5=r6,r7 [r18]=r5
;;
// // // // // //
Cycle Other Cycle Cycle Cycle Cycle
-2 or earlier instructions 0: ambiguous store 0: check load 0 1
The original load has been turned into a check load, and an advanced load has been scheduled above the ambiguous store. In this way, the speculative load is performed several clock cycles (Cycle–2) before the data that it loads into Gr6 are needed by the add instruction. If the speculation succeeds, the execution time of the remaining nonspeculative code is reduced because the latency of the load is hidden. Also, the compiler could move up both the load and one or more of the instructions that uses its results. Remember that when this is done, a chk.a instruction rather than a ld.c instruction is needed to validate the advanced load operation. For our earlier example, the add instruction can also be moved above the load and executed speculatively, as shown in Figure 5.4 . If this data-speculative load fails, much more software overhead is required to recover.
ld8.a
r6=[r8]
add
r5=r6,r7
st8 chk.a back: st8 recover: ld8 add br
;;
[r4]=r12 r6,recover [r18]=r5
r6=[r8] ;; r5=r6,r7 back
// // // // // // // // // // // // //
Cycle -2 Other instructions Cycle 0: add that uses r6 Other instructions Cycle 0 Cycle 0: check Return point from jump to recover Cycle 0 Recovery by reexecuting Instructions nonspeculatively Reload r6 from [r8] Reexecute the add Jump back to main code
Figure 5.4: Data Speculation Routine Involving Speculation Check and Recovery When using advanced loads for data speculation, the compiler must decide on a case-bycase basis whether advancing only a load and using an ld.c is preferable to advancing both a load and its uses, which would require the use of the potentially more expensive chk.a.
TeamUnknown Release
Chapter 5 - Memory Access and Register Transfer Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
SEMAPHORE INSTRUCTIONS A semaphore is a flag or a variable that is used by several consumers to arbitrate the use of one or more resources in a system. Consumers can mean threads, processes, or any other entity in the system. For example, let us assume that there are several applications running on an OS and that each application can request the service of a printer. Furthermore, let us assume that there is only one printer (resource) attached to this system. A semaphore can be used to keep track of the printer use so that the different applications don’t send their outputs to the printer simultaneously. Semaphore instructions perform the following actions: load a general register from memory, perform an operation on the value in this register, then store the result of this operation to the same storage location in memory. The memory read and write operations take place in the same clock cycle. This kind of memory access is called an atomic load. Three types of atomic semaphore operations are defined in the Itanium instruction set. They are the exchange, compare and exchange, and fetch and add operations. The semaphore instructions that perform these operations are always non-speculative.
Exchange Instruction The exchange (xchg) instruction loads a destination general register with the zeroextended contents of a storage location in memory, then stores the value from a source general register in the same memory location. Table 5.10 shows its general from, as follows: (qp) xchgsz.ldhint r1=[r3],r2 Like the load and store instructions, the sz completer attaches to the mnemonic so that it
can be written to process byte, word, double word, or quad word elements of data. Table 5.2 shows that they are selected by replacing sz with the value 1, 2, 4, or 8, respectively. The load hint (ldhint) completers that can be used to allocate a path through the cache memory are identical to those used with the load instruction. These completers and their meanings are listed inTable 5.5. Remember that if no load hint is specified, the selected operation is temporal locality, level 1. As shown in Figure 5.1, this selection indicates that the values accessed during execution of the instruction should be stored in level 1 of the cache memory subsystem and that the path between general registers and main memory is directly through the temporal structure of the cache. The following is an example of the exchange instruction: (p0) xchg4 r20=[r5],r10 When executed, the instruction reads from memory the double word value starting at the base address specified by the value in source register Gr5. Then, the least-significant 4 bytes of the value in source register Gr10 are written to memory, starting at the address specified by the value in Gr5. Finally, the value that is read from memory is zeroextended to 64 bits and placed in destination register Gr20 and the NaT bit corresponding to Gr20 is cleared. Table 5.10: Semaphore Instruction Subgroup Mnemonic
Operation
Format
xchgsz
Exchange
(qp) xchgsz.ldhint r1=[r3],r2
cmpxchgsz
Compare and exchange
(qp) cmpxchgsz.sem.ldhint r1=[r3],r2,ar.ccv
fetchaddsz
Fetch and add immediate
(qp) fetchadd4.sem.ldhint r1=[r3],inc3 (qp) fetchadd8.sem.ldhint r1=[r3],inc3
Both the read and write memory accesses that occur during the exchange operations take place during the same clock cycle. They are performed with acquire semantics. That is, the memory read/write is made visible prior to all subsequent data memory accesses. If the address specified by the value in source operand register r3 is not naturally aligned to the size of the value being accessed in memory, the processor issues an unaligned data reference fault, without checking the value of the user mask alignment checking (UM.ac) bit in the user mask register.
Compare and Exchange Instruction The compare and exchange operation is similar to the exchange operation just
described, except that a compare operation is performed to decide whether or not the exchange operation should take place. This instruction is useful when data need to be exchanged between two variables conditionally. For instance, the application programmer can use a compare and exchange operation conditionally to read-writemodify a semaphore. Table 5.10 shows the general format of the compare and exchange (cmpxchg) instruction as: (qp) cmpxchgsz.sem.ldhint r1=[r3],r2,ar.ccv The values of sz and the ldhint completers are identical to those used with the exchange instruction. However, a new completer called semaphore type (.sem) is appended to this instruction mnemonic. As shown in Table 5.11, the sem completer specifies whether acquire or release semantics are used for the memory transfers that are performed as part of the semaphore operation. Table 5.11: Semaphore Type Completers sem
Ordering semantics
Semaphore operation
acq
Acquire
The memory read/write is made visible prior to all subsequent data memory accesses
rel
Release
The memory read/write is made visible after all previous data memory accesses
When cmpxchg is executed, a value consisting of sz bytes is read from memory, starting at the address indicated by the value in source register r3. This value is zero-extended and compared with the contents of the compare and exchange compare value register (Ar[CCV]), which is an application register. If these two values are equal, the least significant sz bytes of the value in source register r2are written to memory starting at the address specified by the value in r3. Moreover, the zero-extended value read from memory is placed in destination register r1 and the NaT bit associated with r1 is cleared. Like the exchange instruction, if the specified memory address is not naturally aligned to the size of the data being accessed, an unaligned data reference fault occurs. Moreover, the read and write data transfers are performed as an atomic memory access.
Fetch and Add Immediate Instruction The fetch and add immediate (fetchadd) instruction performs an exchange operation in which the value being transferred between general register and memory is incremented by an immediate operand. In Table 5.10, we see that its first format is: (qp) fetchadd4.sem.ldhint r1=[r3],inc3
This form of the instruction only processes 4-byte (double word) data. The source operand register is identified as r3, the destination operand as r1, and the immediate operand as inc3. Values that may be specified by inc3 are limited to –16, –8, –4, –1, 1, 4, 8, or 16. Table 5.10 shows a second form of the instruction that operates on quad word data. When executed, the general register specified as the destination operand (r1) is loaded with the zero-extended contents of the memory location addressed by the pointer in the general register specified as the source operand (r3). Next, the sign extended value of the immediate operand is added to the value loaded into the destination register (r1). The incremented result in the destination register is stored into the memory location pointed to by the address in r3. Finally, the NaT bit corresponding to destination register r1 is cleared. At completion of the instruction, both the destination general register and the storage location in memory contain the value of the incremented source operand. The fetchadd instruction is used commonly in algorithms using semaphores. The semaphore instructions are not privileged, so application programmers can use them. The application programmer can use this operation to update shared lock variables. However, these instructions are more commonly used for operating systems and device drivers. The values and operations performed by the ldhint and semcompleters are identical to those used with the compare and exchange instruction. Furthermore, memory read and write accesses are guaranteed to be atomic, and if the memory address does not align with the size of the value being accessed in memory, an unaligned data reference fault takes place.
TeamUnknown Release
Chapter 5 - Memory Access and Register Transfer Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
REGISTER TRANSFER INSTRUCTIONS Earlier in this chapter, you learned that the register operand in a memory access instruction must be in the general or floating-point register files. Therefore, only these two register files can be loaded with data or address information from memory. Application code must use the general register file as an intermediary to initialize the registers in the predicate, branch, performance monitoring, and application register files or save their information in memory. For instance, to initialize a branch register with an address, the value of the address must first be loaded from memory to a general register, then moved to the appropriate branch register. Moreover, to save the state of the user mask in memory, its contents must be moved first to a general register, then stored in memory. The following sections define the instructions to move information between the general register file and the other register files. These instructions are called the register transfer instructions. The mnemonics and formats of these instructions are shown in Table 5.12. Many of the move instructions are actually pseudo-operations of the add instruction. That is, when the programmer specifies the move instruction, the compiler actually replaces it with a version of an add instruction that implements the same operation. The move instructions used to load the floating-point registers are introduced in Chapter 9, which is dedicated to floating-point architecture and operation.
TeamUnknown Release
Chapter 5 - Memory Access and Register Transfer Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
THE MOVE INSTRUCTIONS The move instructions transfer information between the separate register files. Using operands or completers, variations of the moveinstruction are defined that transfer information between: general registers a general register and the instruction pointer a general register and the branch, predicate, processor identification, performance monitoring, or application registers For instance, the formats in Table 5.12 show that the instruction pointer form of the move instruction is created by identifying the instruction pointer (ip) register as the source operand. On the other hand, the return form of the move branch register instruction is marked with the return (.ret) completer. Table 5.12: Register Transfer Instructions Mnemonic
Operation
Format
mov
General register move
(qp) mov r1=r3
Move indirect register
(qp) mov r1=ireg[r3]
Move instruction pointer
(qp) mov r1=ip
Move branch register
(qp) mov r1=b2 (qp) mov b1=r2 (qp) mov.ret b1=r2
Move predicates
(qp) mov r1=pr (qp) mov pr=r2,mask17 (qp) mov pr.rot=imm44
Move application register
(qp) mov r1=ar3
Mnemonic
Operation
Format (qp) mov ar3=r2 (qp) mov ar3=imm8 (qp) mov.i r1=ar3 (qp) mov.i ar3=r2 (qp) mov.i ar3=imm8 (qp) mov.m r1=ar3 (qp) mov.m ar3=r2 (qp) mov.m ar3=imm8
Move user mask
(qp) mov r1=psr.um (qp) mov psr.um=r2
sum
Set user mask
(qp) sum imm24
rum
Reset user mask
(qp) rum imm24
General Register Move Instruction How is the general register to general register move operation performed? This operation is implemented with the general register move instruction, which is written as follows: (qp) mov r1=r3 When executed, the instruction copies the value of the source operand in the general register identified by r3 to the destination general register identified by r1. The following is an example of the instruction: (p0) mov r10=r127 Example What happens when the following instruction is executed? (p0) mov r127=r0 Solution Remember that Gr0 is a special register. It is read-only and has its bits all hardwired to logic 0. Therefore, executing this instruction clears all bits in Gr127.
Move Indirect Instruction
The move indirect register instruction performs an indexed read access of a processor identification register or performance monitoring data register. Table 5.12 shows that the general format of this instruction is as follows: (qp) mov r1=ireg[r3] Here the source operation is identified as ireg[r3]. The first part of this operand, ireg, gives the name of the register file to be indirectly accessed. Table 5.13 shows that two possible values are cpuid and pmd. The general register specified by r3 holds the value used to index this file of registers. Bits 0 through 7 of r3 are used as the index; the rest of the bits are ignored. Remember, the five processor identification registers are CPUID registers 0 through 4. As an example, here’s how to write an instruction that will read the processor identifier from CPUID register 3 into Gr10. Assume that register Gr5 is used to hold the index. Since CPUID register 3 holds the identifier information of the processor, ireg must equal cpuid and Gr5 must contain the value 3H. Assuming that execution of the instruction is not conditional on any predicate bit, the instruction is written as: Table 5.13: Indirect Register File Mnemonics ireg
Register file
cpuid
Processor identification register
pmd
Performance monitor data register
(p0) mov r10=cpuid[r5]
Move Instruction Pointer Instruction The next instruction in Table 5.12 copies the value of the instruction pointer to a general register. This is the move instruction pointer instruction and its general format is: (qp) mov r1=ip Here ip stands for the instruction pointer register and r1 for the general-purpose register into which the value of ip is copied. For instance, to copy the value of the instruction pointer into Gr5, the instruction is written as: (p0) mov r5=ip When this instruction is executed, the address of the bundle that contains this instruction is copied into destination register Gr5.
Move Branch Register Instruction
The ISA also provides a move instruction to transfer addresses between the branch registers and the general registers. As shown in Table 5.12, the move branch register instruction has three formats. The first format copies an address from one of the branch registers, identified in general as b2, to one of the general registers identified as r1. For example, the instruction that follows copies the value of the address in BR5 to general register Gr10. (p0) mov r10=b5 When the copy takes place, the NaT bit corresponding to destination general register r1 is cleared. The second format performs the reverse operation. That is, it copies an address from a general register into a branch register. For instance, this instruction loads branch register 5 with an address from general register 10. (p0) mov b5=r10 This move instruction differs from those already described in that its operation is affected by the setting of the NaT bit associated with a general register. If the NaT bit corresponding to Gr10 is 1 when this instruction is executed, then a register NaT consumption fault occurs. The last format of the move branch register instruction includes the return (.ret) completer. This form of the instruction is used to provide hints to a future or downstream return-type branch.
Move Predicates Instruction Remember, the Itanium processor’s application state has a predicate register file that consists of the 64 1-bit-wide registers Pr0 through Pr63. The move predicates instruction transfers information between these registers and a general register. The general formats of this instruction, given in Table 5.12, permit three different operations to be performed: copying the values in all 64 predicate registers into a general register loading select predicate registers with values from a general register loading of the rotating predicate registers with values from an immediate operand The first form of the move predicate instruction shown in Table 5.12 is used to copy the values in all 64 predicate registers into a destination general register identified by r1. The relationship between a predicate register value and its corresponding bit position in the general register is illustrated in Figure 5.5.
Figure 5.5: Relationship Between Predicate Registers and Bit Locations Notice that Pr0 is copied into the least-significant bit of the general register, and so on up through Pr63, which is in the most-significant bit position. For example, the following instruction copies the entire predicate register file into Gr10. (p0) mov r10=pr Remember that the value in Pr0 is hardwired as 1. In Chapter 3, the predicate registers were partitioned into a static group and a rotating group. The lower 16 registers, Pr0 through Pr15, are the static predicate registers and the upper 48 registers, Pr16 through Pr63 are the rotating predicate registers. The second form of the instruction is used to copy values from a source operand r2, a general register, to select predicate registers. A second source operand, immediate operand mask17, tells which of the predicate registers are updated when the instruction is executed. A value of 1 in a mask bit position means that the value in the corresponding predicate register is updated, and 0 means that it is not modified. Even though the mask has just 17 bits, it affects all 64 predicate registers because the value is sign-extended before it is used to mask the bits of r2. In this way, we see that the values in the lower 16 mask bits correspond to the individual static predicate registers and the value of the 17th bit, which is the sign bit, is used for all of the48 rotating registers. An example is the following instruction: (p0) mov pr=r10,10016H Expressing the mask in binary form gives 100000000000101102 and sign-extending results in the following value: 11111111111111111111111111111111111111111111111100000000000101102 Therefore, when the instruction is executed, the values in Pr1, Pr2, Pr4, and Pr16 through Pr63 are updated from the corresponding bit positions in Gr10.
Loading of the predicate registers is also affected by the state of the NaT bit for r2. If NaT is cleared to 0, the instruction executes. On the other hand, if it is set to 1 a register NaT consumption fault results. The last form of the move predicate instruction affects only the 48 rotating predicates. For this reason, the .rot completer is added to the instruction mnemonic. The source operand is taken from the imm44 operand, which is encoded in the instruction as a 28-bit value. The values of the lower 27 individual bits correspond to Pr16 through Pr42, respectively, and the value of the most significant bit represents Pr43 through Pr63.
Move Application Register Instruction The last form of the move instruction in this section is the move application register instruction. Formats are given in Table 5.12 for three basic forms of this instruction. They perform the following actions: save the content of an application register to a general register load an application register from a general register load an application register with an immediate operand Certain application register accesses must be performed by the integer execution (I) unit others by the memory execution (M) unit. Table 5.14 shows those application registers that are associated with the I-unit and M-unit. For this reason, each of the move application register instruction formats are repeated in Table 5.12 with a memory (.m) and integer (.i) completer. An access to an application register with the wrong execution unit type causes an illegal operation fault. The first form of the instruction is used to copy the content of the specified application register to a general register: (qp) mov r1=ar3 The second form performs the reverse operation. It copies the value from the specified general register into the application register. For example, in Table 5.14, read and write operations to Ar0 must be performed by the memory execution unit. Therefore, the instruction needed to save its value in Gr5 is written as follows: (p0) mov.m r5=ar0 When executed, the value in Ar0 is copied into Gr5 and the corresponding NaT bit is
cleared. Ar0 can be reload from Gr5 with the instruction: (p0) mov.m ar0=r5 If the NaT bit corresponding to Gr5 is set when this instruction is executed, a register NaT consumption fault results. The source operand in the third form of the move application register instruction is immediate operand imm8. The value of this 8-bit operand is sign-extended to 64-bits before it is placed into the specified application register. Table 5.14: Execution Unit Required to Access Application Registers Register
Description
Execution unit type
Ar0–Ar7
Kernel registers 0 through 7
M
Ar8–Ar15
Reserved
M
Ar16
Register stack configuration register
M
Ar17
Backing store pointer
M
Ar18
Backing store pointer for memory stores
M
Ar19
RSE NaT collection register
M
Ar20
Reserved
M
Ar21
IA-32 compatibility
M
Ar22–Ar23
Reserved
M
Ar24
IA-32 compatibility
M
Ar25
IA-32 compatibility
M
Ar26
IA-32 compatibility
M
Ar27
IA-32 compatibility
M
Ar28
IA-32 compatibility
M
Ar29
IA-32 compatibility
M
Ar30
IA-32 compatibility
M
Ar31
Reserved
M
Ar32
Compare and exchange compare value register
M
Ar33–Ar35
Reserved
M
Ar36
User NaT collection register
M
Ar37–Ar39
Reserved
M
Ar40
Floating-point status register
M
Ar41–Ar43
Reserved
M
Ar44
Interval time counter
M
Ar45–Ar47
Reserved
M
Ar48–Ar63
Ignored
M or I
Ar64
Previous function state
I
Ar65
Loop count register
I
Ar66
Epilog count register
I
Ar67–Ar111
Reserved
I
Ar112–Ar127
Ignored
M or I
TeamUnknown Release
Chapter 5 - Memory Access and Register Transfer Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
USER MASK INSTRUCTIONS In Chapter 3, the user mask was described as part of the processor status register. In fact, it is the six least-significant bits of this register and it is denoted as PSR{5:0}. The user mask contains bits that either select operating options or monitor instruction execution. Operating options might be little-endian or big-endian data organization, and enabling or disabling of alignment check for data memory references. Three instructions are provided to permit the saving or loading of the value in the user mask, ORing its content with an immediate operand, and ANDing its content with an immediate operand. Table 5.12 gives their formats.
Move User Mask The first user mask instruction is actually another form of the move instruction. This instruction, move user mask, either saves the content of the user mask part of the processor status register in a general register or loads it from a general register. To save the bits of the user mask by copying them to a general register, this general form of the instruction is used: (qp) mov r1=psr.um Note that the user mask is identified as an operand in instructions with the notation psr.um. When this instruction is executed, the 6-bit value psr.um is zero-extended to 64-bits then saved in the general register specified as the destination operand r1. The other form of this instruction is used to load a value into the user mask from source general register r2. For instance, executing the following instruction causes the six leastsignificant bits of general register 10, Gr10{5:0}, to be written into the user mask, PSR{5:0}. (p0) mov psm.um=r10
Set User Mask Instruction
The set user mask (sum) instruction selectively sets bits in psr.umto 1. As shown in Table 5.12, the source operand is the 24-bit immediate operand imm24. When the instruction is executed, the user mask (PSR{5:0}) is read, ORed with imm24, and the result produced by the OR operation placed back in the user mask. The user performance monitor (psr.up) bit of the user mask can be set only if the value of the secure performance monitor (psr.sp) bit in the processor status register is 0. Otherwise, the value in psr.up is not modified. Example Assume that the original value in the user mask is 0010002. What is the effect of executing the following instruction? (p0) sum 02H Solution When the instruction is executed, the current value in psr.um is read as: PSR{5:0} = 0010002 This value is ORed with the immediate operand 000102 to give this value: 0010002 + 0000102 = 0010102 Therefore, the new content of psr.um is 0010102. This shows that the sum instruction has set the be bit in the user mask to 1 and selected big-endian data organization.
Reset User Mask Instruction The last instruction in Table 5.12, reset user mask (rum), is similar to set user mask (sum) except that it is used to clear bits in the user mask to 0. It also has a 24-bit immediate operand. However, the operation it performs is an AND operation on the value in the psr.um. Execution of the rum instruction causes the complement of the imm24 operand to be ANDed with the user mask (PSR{5:0}), and the result is placed back in the user mask. Again the psr.sp bit must be 0, otherwise psr.up is not modified.
TeamUnknown Release
Chapter 6 - Integer Computation, Character String, and Population Count Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Chapter 6: Integer Computation, Character String, and Population Count Instructions The previous chapter covered the Itanium processor’s memory access and data transfer instruction groupings. This chapter covers the integer computation and character and population count instruction groupings. The following sections discuss the operation of each instruction and explain their use in some basic applications.
INTEGER COMPUTATION INSTRUCTION GROUPING The integer execution units (I-units) provide a set of instructions to perform traditional software operations, such as arithmetic computations, logical decisions, shifts, bit-field manipulation, and acceleration of 32-bit data and pointer operations. To simplify the explanation of these operations, similar integer computation instructions are grouped together to form the following five categories: Integer arithmetic instructions Logical instructions Large constant generating instructions 32-bit integer and 32-bit address pointer instructions Shift and bit-field instructions Either the I-unit or M-unit can actually execute the arithmetic, logical, and 32-bit acceleration instructions.
TeamUnknown Release
Chapter 6 - Integer Computation, Character String, and Population Count Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
INTEGER ARITHMETIC INSTRUCTIONS The integer arithmetic instruction subgroup has three base instruction types: add, subtract, and shift left and add. Table 6.1 identifies these instructions and gives their general formats. The Itanium processor’s instruction set does not provide separate instructions to perform integer multiply and divide operations. However, the floating-point unit performs integer multiplication, division, and remainder operations. For this reason, these operations are introduced in a later chapter that covers the floating-point architecture.
Add Instructions Depending on the type of operation performed and the operands selected, five different forms of the add instruction are available. The first form shown in Table 6.1 is: (qp) add r1=r2,r3 Table 6.1: Integer Arithmetic Instructions Mnemonic
Operation
Format
add
Add
(qp) add r1=r2,r3 (qp) add r1=r2,r3,1 (qp) adds r1=imm14,r3 (qp) addl r1=imm22,r3 (qp) add r1=imm,r3
sub
Subtract
(qp) sub r1=r2,r3 (qp) sub r1=r2,r3,1 (qp) sub r1=imm8,r3
shladd
Shift left and add
(qp) shladd r1=r2,count2,r3
Known as the register form, this instruction adds the signed integer numbers in source
registers r2 and r3, if qualifying predicate qp is True. Their sum is placed in destination register r1. Expressed arithmetically, the operation is: Gr1 = Gr2 + Gr3 Remember that qp is a general representation that stands for any one of the 64 1-bit predicate registers Pr0 through Pr63 in the application register set. The explanation of in Chapter 3 established that Pr0 is always True (1). Choosing Pr0 for qp make the execution of the instruction nonconditional. Next, look at the selection of the registers. In this format, the destination is identified as r1 and source registers as r2 and r3. Again, this generalization means first general register, second general register, third general register, not that registers Gr1, Gr2, and Gr3 must always be used. Any of the 128 general registers can be selected as the locations of the source or destination operands. Also remember that register Gr0 always contains 0H. Finally, as pointed out in Chapter 4, no completers modify the base add instruction. The second form of the add instruction shown in Table 6.1 is similar to the base instruction just described, except for its third source operand that is always 1H. Called the plus 1 register form, it can be used to perform 2’s complement arithmetic operations. The operation performed by this instruction can be viewed as an add with carry. The add with carry function is needed when adding more than two data operands. Example If the values in Gr3 and Gr4 are FFFFFFFF00000000H and 0000000012345678H, what result is produced by executing the following instruction? (p0) add r5=r4,r3,1 Solution The addition performed by the instruction is as follows: Gr5 = Gr4 + Gr3 + 1 = FFFFFFFF00000000H + 0000000012345678H + 1H = FFFFFFFF12345679H
The next three forms of the add instruction shown in Table 6.1 add an immediate operand to a source operand located in a register. The first form, which is called the imm14 form and identified by the mnemonic adds, adds a 14-bit immediate operand (a
13-bit value plus a sign bit) to any of the general purpose registers. The second form, the imm22form (addl), adds a 22-bit immediate operand (sign bit plus 21-bit value) to the content of a general register. In this case, the selection of the general register for the other source operand is limited to Gr0 through Gr3. In both forms, the value of the immediate operand is encoded into the instruction. During execution, the value of the immediate operand is sign-extended to 64-bits. The last immediate add instruction form is actually a pseudo-operation for the two earlier immediate add instructions. That is, it can be used in place of either of them in a program. Notice that it does not specify the size of the immediate operand. When this version is used, the compiler looks at the size of both the specified immediate operand and the value of the source operand in r3. Based on their sizes, the compiler selects either the imm14 or imm22 format for coding of the instruction.
Subtract Instructions Table 6.1 shows that the subtraction instructions are similar to those provided for addition. However, only three of the five formats are supported. The register form permits software to subtract the source operand value in the general register identified by r3from the value of the source operand in the general register specified by r2. The difference that is produced is held in the general register specified by destination operand r1. Arithmetically, the operation is: Gr1 = Gr2 – Gr3 The values in Gr1, Gr2, and Gr3 are signed integer numbers. An example of the minus 1 form is the instruction (p0) sub r5=r10,r11,1 When executed, this instruction performs the subtract with borrow operation. This operation is expressed arithmetically as: Gr5 = Gr10 – Gr11 – 1 Example If the value in Gr5 is 000000000000000FH, what result is produced in Gr10 by executing the following instruction? (p0) sub r10=0FAH,r5 Solution The numerical values of the register source operand and sign-extended immediate
operand are 000000000000000FH = +15 and FFFFFFFFFFFFFFFAH = –6, respectively. The subtraction is performed as Gr10 = imm8 – Gr5 = –6 – (+15) = –21
Shift Left and Add Instruction The last instruction in the arithmetic group is the shift left and add instruction (shladd). This instruction performs a special addition operation. As shown in Table 6.1, it only has one format. On the source side of the operand expression, the instruction has two source registers, r2 and r3, and a binary count, which is identified as count2. The value of the count is limited to the range of 1 through 4. When the instruction is executed, the processor shifts the bits of the value of the source operand specified by r2 left by the number of bit positions indicated by the count. As part of the shift operation, vacated least-significant bits are filled with 0s and bits shifted out of the most significant bit location are discarded. After the shift is complete, the new value in r2 is added to the value in source register r3 and the result is put into destination register r1. This addition operation with a shift of an operand is important for integer arithmetic. The operation is useful especially in algorithms where values are multiplied by powers of 2 then added to another value. Example Write an instruction that uses registers Gr21 and Gr22 as the source operands, register Gr23 as the destination operand, and requires that the bits in Gr21 be shifted four bit positions left prior to the add taking place. Make execution of the instruction conditional, based on the state of predicate bit 12. Solution The shladd instruction performs this operation, as follows: (p12) shladd r23=r21,4,r22
TeamUnknown Release
Chapter 6 - Integer Computation, Character String, and Population Count Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
LOGICAL INSTRUCTIONS Logical instructions are another important subgroup of the integer computation instruction group. Used to perform logic operations, such as AND, OR and exclusive OR, on data, the logic instructions are needed by the application programmer to implement logical computations and mask operations.
AND, OR, and Exclusive OR Instructions Table 6.2 provides a list of the instructions in the logical instruction subgroup and their formats. The mnemonics for the AND, OR, and exclusive OR instructions are and, or, and xor, respectively. Notice that each of these instructions supports the same two operand formats. The first format, the register form, represents performing the specified logic operation on two source operands, r2 and r3, that are in general registers. The result produced by performing this logic operation is held in another general register, which is identified as destination register r1. Again, execution of the instruction depends on the state of the qualifying predicate specified by qp. An example of this type of instruction is: (p1) and r10=r20,r30 If the value in predicate register Pr1 is True, this instruction performs the logic operation: Gr10 = Gr20 • Gr30 The value in general register 20 is bit-wise logical ANDed with the value in general register 30 and the result is placed in general register 10. In the other format identified in Table 6.2, the first source operand is replaced with an 8bit immediate operand. This immediate operand, which is marked imm8, is encoded into a field within the instruction. An example is the instruction:
(p0) xor r5=1FH,r10 Table 6.2: Logical Instructions Mnemonic
Operation
Format
and
Logical AND
(qp) and r1=r2,r3 (qp) and r1=imm8,r3
or
Logical OR
(qp) or r1=r2,r3 (qp) or r1=imm8,r3
xor
Logical exclusive OR
(qp) xor r1=r2,r3 (qp) xor r1=imm8,r3
andcm
Logical AND complement
(qp) andcm r1=r2,r3 (qp) andcm r1=imm8,r3
When the instruction is executed, the value of the 8-bit operand, which is 1FH, is signextended then exclusive-ORed with the value in Gr10. The result of this logical exclusive OR operation is saved in Gr5. Like the arithmetic instructions, no completers are specified with logic instructions.
AND Complement Instruction The last instruction in the logical group is named AND complement and is identified by the mnemonic andcm. Notice in Table 6.2 that it supports the same two formats as the other three logic instructions. When executed it does a 1’s complement on the second source operand (r3) before combining its value with the first source operand (r2) with an AND logical operation. Example Assume that the current contents of Gr10 are FFFFFFFFFFFFFF45H. What result is produced in the destination register when the following instruction is executed? (p0) andcm r2=F0H,r10 Solution First the 1’s complement of the value in Gr10 is formed. This gives: 1’s complement of FFFFFFFFFFFFFF45H = 00000000000000BAH Now, ANDing with the other source operand produces:
Gr2 = FFFFFFFFFFFFFFF0H • 00000000000000BAH = 00000000000000B0H
TeamUnknown Release
Chapter 6 - Integer Computation, Character String, and Population Count Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
LARGE CONSTANT GENERATING INSTRUCTIONS The instruction set defines special instructions for generating large constants, which means creating a constant by loading a value into a general register. The two instructions available for this purpose are shown in Table 6.3. Table 6.3: Large Constant Generation Instructions Mnemonic
Operation
Format
movl
Move long immediate
(qp) movl r1=imm64
mov
Move immediate
(qp) mov r1=imm22
Notice that the first instruction, move long immediate, copies a 64-bit immediate source operand, which is identified as imm64, into the destination general register. An example is the following instruction: (p3) mov r10=0FFFFFFFF00000000H If the value in predicate register 3 is True when this instruction is executed, generalpurpose register 10 is initialized with immediate value FFFFFFFF00000000H. Because of the large immediate operand, this instruction occupies two slots within the same bundle and is the only such instruction. The other instruction, move immediate, copies constants up to 22 bits in size (sign bit plus 21-bit value) into any of the general-purpose registers. The 23-bit constant is signextended to 64 bits before it is placed in the register. This instruction is actually a pseudo-operation for an add instruction. For instance, if the compiler identifies the instruction (p0) mov r10=30F0F0H it replaces the instruction with the add operation (p0) addl r10=30F0F0H,r0 Remember that Gr0 contains a hardwired value of 0. Therefore, adding the content of Gr0
to the immediate operand simply loads the value of the immediate operand into Gr10.
TeamUnknown Release
Chapter 6 - Integer Computation, Character String, and Population Count Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
32-BIT INTEGER AND 32-BIT ADDRESS POINTER INSTRUCTIONS Special instructions are also provided in the instruction set to handle address and data that is less than 64 bits in length. These instructions take an 8-bit, 16-bit, or 32-bit value in a register, and produce a properly extended 64-bit result. For instance, in the case of a 32-bit signed integer, the sign bit has to be extended into the upper 32-bit locations. For a 32-bit unsigned integer, these bit positions are filled with 0s.
Sign-Extend and Zero-Extend Instructions The instructions that are used to do sign fill and zero fill operations on integer data are called sign-extend (sxt) and zero-extend (zxt), respectively. The formats of the sxt and zxt instructions are given in Table 6.4. Table 6.4: 32-bit Address Pointer and 32-bit Integer Instructions Mnemonic
Operation
Format
sxt
Sign-extend
(qp) sxtxsz r1=r3
zxt
Zero-extend
(qp) zxtxsz r1=r3
addp
32-bit pointer addition
(qp) addp4 r1=r2,r3 (qp) addp4 r1=imm14,r3
shladdp
Shift left and add 32-bit pointer
(qp) shladdp4 r1=r2,count2,r3
Notice that the sign-extend instruction is given in general as: (qp) sxtxsz r1=r3 The mnemonic has been extended with xsz, which stands for extend size. It represents a mnemonic with one of 3 values (1, 2, 4) that tells how many bytes the original piece of data contains. Table 6.5 shows the possible values for xsz. Table 6.5: Available Values for the xsz Part of Mnemonic
xsz mnemonic
Bit position
1
7
2
15
4
31
If the instruction is to extend a signed byte to a signed quad word, the xsz part of the mnemonic is replaced by 1. All bit positions more significant than bit 7 are to be filled with the value of the sign bit (bit 7). On the other hand, if a signed double word is being extended to a signed quad word, xsz takes on the value 4 and all bits higher than bit 31 are filled with the value of the sign bit. For example, if Gr6 contains the word F550H (11110101010100002) when the instruction is executed, the sign bit (b15) is 1. (p0) sxt2 r5=r6 Executing this instruction produces the quad word FFFFFFFFFFFFF550H in Gr5. As we indicated earlier, the zxt instruction is used to extend unsigned integer data bytes, words, and double words to 64 bits. It works similar to sxt except in this case the bits above the specified bit location are filled with 0s.
Add Pointer Instruction The Itanium architecture provides support for extending 32-bit address pointers to 64bits. For instance, the add pointer (addp4) instruction adds a 32-bit pointer that is in a register to either a 32-bit offset in another register or a 14-bit immediate operand. These instruction formats are shown in Table 6.4. The operation of the register form of the addp4 instruction is illustrated in Figure 6.1. The first step in the extension process is to clear the 32 most significant bits of destination register r1. Then the value of bits 30 and 31 of the second source operand (r3) are copied into bits 61 and 62, respectively, of the destination operand. Finally, the value of the 32-bit pointer in r3 is added to the 32-bit offset in r2 and the results placed in the 32 least-significant bit locations of r1.
Figure 6.1: Add Pointer Operation The same operation is performed for the immediate operand form of the addp4 instruction, except that the immediate operand is first sign-extended to 32 bits. The ability to extend pointers is required for the virtual address translation model. For example, after the 32-bit add and forcing the upper 32 bits to zero, the top 2 bits of r3 are copied into 62:61 of destination register r1. As a result of forcing bit 63 to zero and copying bit values into bit 62 and 61 of the destination register, the pointer is in the lower 4 gigabyte region and has the same virtual region number as pointer operand r3.
Shift Left and Add Pointer Instruction Another instruction that is used for extending address pointers to 64-bits is shift left and add pointer (shladdp4). Looking at the format of this instruction in Table 6.4, you can see that it only processes operands that are in registers and that a count is included in the source operand field. Its operation is displayed in Figure 6.2. Notice that this instruction performs 32-bit pointer addition that is similar to that provided by the register form of the addp4 instruction. However, this time the operation shifts the value in source operand register r2 left by the number of bits specified by count2 before the add takes place. The least significant bits that are vacated during the shift operation are filled with zeros. The only values allowed for the count are 1, 2, 3,or 4. Again, bits 30 and 31 of source operand r3 are copied to bit locations 60 and 61 of destination register r1.
Figure 6.2: Shift Left and Add Pointer Operation Example What result is produced when the following instruction is executed? (p0) shladdp4 r10=r5,4,r6 Assume that the values of the address pointers in Gr5 and Gr6 are 12345678H and 55555555H, respectively. Solution First the value in Gr5 is shifted left by four bit positions to give this result: Gr5 = 123456780H Next, we will determine the value in bits 30 and 31 of the value in Gr6. Expressing in binary form, we get: Gr6 = 010101010101010101010101010101012 Therefore, b31b30 = 012 = 1H and the most significant four bits of the 64-bit register Gr10 are b63b62b61b60 = 00102 = 2H Now the 32-bit shifted values of Gr2 and Gr3 are added. Gr2 + Gr3 = 23456780H + 55555555H = 789ABCD5H Assembling these results in Gr10 gives Gr10 = 20000000789ABCD5H
TeamUnknown Release
Chapter 6 - Integer Computation, Character String, and Population Count Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
SHIFT AND BIT-FIELD INSTRUCTIONS Three classes of shift operations are defined for shifting and operating on bit fields within a general register: variable shifts, 128-bit-input funnel shift, and fixed shift-and-mask operations. The instructions that perform these operations comprise the shift and bit-field instruction subgroup, which includes the shift left, shift right, shift right pair, extract, and deposit instructions. The mnemonics and formats for each of these instruction types are shown in Table 6.6. Table 6.6: Shift and Bit Field Instructions Mnemonic
Operation
Format
shl
Shift left
(qp) shl r1=r2,r3 (qp) shl r1=r2,count6
shr
Shift
(qp) shr r1=r3,r2 (qp) shr.u r1=r3,r2 (qp) shr r1=r3,count6 (qp) shr.u r1=r3,count6
shrp
Shift right pair
(qp) shrp r1=r2,r3,count6
extr
Extract (shift right and mask)
(qp) extr r1=r3,pos6,len6 (qp) extr.u r1=r3,pos6,len6
dep
Deposit (shift left, mask and merge)
(qp) dep r1=r2,r3,pos6,len6 (qp) dep r1=imm1,r3,pos6,len6 (qp) dep.z r1=r2,pos6,len6 (qp) dep.z r1=imm8,pos6,len6
Shift Left and Shift Right Instructions The shift left instruction is used to shift the bits of a source operand in a general register left by a specified number of bit positions. The count representing the number of bit positions to be shifted can be specified in a second general register or as an immediate operand. Each time bits are shifted to the left, the vacated least-significant bit location is filled with a 0. Looking at the format of shl in Table 6.6, you can see that, for the register form of the instruction, the source operand whose bits are to be shifted is identified as r2 and the count is identified as r3. Even though a 64-bit register is supplied to hold the count, its value is limited to the six least-significant bits of this register and is interpreted as an unsigned number. In this way, the value of the shift count is limited to from 0 to 63 (26). The result that is produced by the shift operation is placed in destination register r1. If the count in r3 exceeds 63, the result that is returned to the destination operand r1 is all 0s. The immediate operand form of the instruction produces the same result as the register form. It only differs in the way the count is specified. Notice that the immediate operand is labeled count6 and is also limited to six bits. An example is the instruction: (p0) shl r10=r5,8 The shift right instruction shifts the bits of a source operand in a general register in the opposite direction, that is, from the most-significant bit end toward the least-significant bit end. As shown in Table 6.6, this instruction is available in both a signed and unsigned form. Notice that the completer .u is appended to the mnemonic shrto distinguish the unsigned shift right (shr.u) instruction from the signed shift right (shr) instruction. The shift right operation works differently for the signed and unsigned instruction. In both cases, bits are shifted from most-significant bit end toward the least-significant bit position, as described earlier, but the fill value entered in the vacated most-significant bit position changes. For the case of a signed shift right, the most-significant bit is filled with the value of the original sign bit, bit 63. On the other hand, the vacated most-significant bit is filled with 0 during unsigned shift right operations. The formats allowed for the signed and unsigned shift right instruction are shown in Table 6.6. They are the same as the register form and immediate form used by the shift left instruction. As before, the count is specified by either the second source operand (r2) or the immediate operand (count6) and is limited to the range 0 through 63. However, the operations of the instructions differ if a count greater than 63 is specified in the register form. If this count is specified in a shr.uinstruction, the result produced in r1 is all 0s. But, for a signed shift right, all bit positions in r1 are filled with the value of the original sign bit of the operand in r3.
Example Gr10 holds the value FEDCBA9876543210H. What result is produced in destination register Gr5 by executing the following pair of instructions? (p0) shr r5=r10,8 ;; (p0) shl r5=r10,4 Solution The first instruction performs an 8-bit signed shift right of the value in Gr10. Since b63 in Gr10, which is the sign bit, is 1, the vacated most significant bits are filled with 1s. This gives the result Gr5 = FFFEDCBA98765432H Executing the shift left instruction causes the value in Gr 10 to be shifted left by four bit positions and the vacated bits are filled with 0s. This results in Gr5 = FFEDCBA987654320H
Shift Right Pair Instruction The shift right pair (shrp) instruction performs a 128-bit-input funnel shift. The diagram in Figure 6.3 demonstrates this operation. Notice that the instruction concatenates two source operand registers, r2 and r3, to form a 128-bit value. An immediate count reports the number of bit positions that this value is shifted right through the two registers. After the shift is completed, the 64 least-significant bits of the concatenated registers represent the result. This value is placed in destination register r1. Application programmers can use the funnel shift operation to simplify the realignment of unaligned data. For example, if you have a large data structure (> 64 bits) and you want to extract a field of bits that straddle across a 64-bit boundary, this instruction would be useful.
Figure 6.3: 128-bit-input Funnel Shift Operation As indicated by the format of the instruction in Table 6.6, the value of the count is again six bits in length and limits the shift operation to a range from 0 to 63 bit positions. Example
If the contents of Gr5 are 0123456789ABCDEFH, what results are produced by executing this pair of instructions? (p0) shrp r10=r5,r5,4 ;; (p0) shrp r10=r5,r5,8 What operation is this form of the instruction performing? Solution Using the same register for both of the source operands creates the 128-bit value: 0123456789ABCDEF0123456789ABCDEFH Executing the first instruction causes the bits in this 128-bit value to shift right by four bit positions, then the 64 least-significant bits are saved in Gr10 as the result: Gr10 = F0123456789ABCDEH The second instruction performs the same operation, except that the shift right is eight bit positions, resulting in: Gr10 = DEF0123456789ABCH Looking at these two results, you can see that a rotate operation has been implemented. That is, in the result, bits shifted out at the least-significant end have been rotated back in at the most-significant bit end.
Extract and Deposit Instructions The fixed shift-and-mask operations are performed by the bit field instructions extract (extr) and deposit (dep). The extrinstruction copies a selected field of bits from a source operand in a general register to the least-significant bit locations of the destination register. Figure 6.4 illustrates this type of operation. In this example, bits 7 through 56 in source operand register r3 are selected and copied to the bit locations 0 through 49 of destination register r1, masking off bits of the original value and shifting the field of bits that remains to a fixed position. The unused bits of the destination register are filled with 0s, if an unsigned extract operation is performed, or with the value of the sign bit, if a signed extract is taking place. The sign is taken from the most significant bit of the extracted field. If the specified field extends beyond the most-significant bit of r3, the sign is taken from the most-significant bit of r3. Now look at the format of the unsigned and signed extract instructions in Table 6.6. Adding the completer .u to the mnemonic extr creates the unsigned instruction.
Moreover, the operands used in both instructions are the same. The general register that holds the value from which bits are to be extracted is specified as r3 and two immediate operands, pos6 and len6, define the field of bits to be extracted. Here pos6 identifies the bit position where the field begins and len6 tells how many bits are in the field. For instance, in the example of Figure 6.4, pos6 equals 7 and len6 equals 50. The immediate value len6 can be any number in the range 1 to 64, and this value is encoded as length-1 in the instruction. The immediate value pos6 can be any value in the range 0 to 63.
Figure 6.4: Extract Operation Example If the value in Gr5 is FFFFAAAA99990000H, what results are produced in Gr10 by executing the following instruction? (p0) extr.u r10=r5,32,8 Solution Execution of this instruction extracts the byte in bit positions 32 through 39 of the value in Gr5. The value of this byte is AAH. Placing this value in bit positions 0 through 7 of Gr10 and zero-extending to 64-bits produces the result Gr10 = 00000000000000AAH
The deposit instruction performs the reverse of the operation performed by the extract instruction. A specified number of lower order bits from one source operand are merged into an equal size field of bit positions within the value in the second source operand to produce the result in the destination operand. Figure 6.5 demonstrates the operation of the merge register form of the deposit instruction. The format of this instruction is shown in Table 6.6 as: (qp) dep r1=r2,r3,pos6,len4 Looking at Figure 6.5, you can see that the field to be deposited resides in source register r2. The length of this field is defined by the immediate value len4 and it is rightjustified in r2. Since len4 has only four bits, the size limit for the extracted field is 1 to 16 bits, and it is encoded as length-1 in the instruction. This field of bits is merged with the
value in source operand r3 by depositing it at a position defined by immediate value pos6. The pos6 immediate value has a range of 0 to 63 bits. The result produced by merging bits from r2into the value in r3 is placed in destination register r1. In the event that the deposited bit field extends beyond bit 63 of the destination register, those bits that would extend beyond the end are truncated.
Figure 6.5: Deposit Operation As an example, here is the instruction that will produce the results shown in Figure 6.5. 16 bits are deposited from r2, so the value of len4 is 10H. The bits are merged into the value in r3, starting at bit position 36. This merge results in pos6 equaling 24H. Assuming that the instruction’s execution is not qualified by a predicate, it is written as: (p0) dep r1=r2,r3,24H,10H Three other forms of the deposit operation are shown in Table 6.6. The operations they perform are slightly different than those just described. For instance, the merge form immediate version of the instruction is written as: (qp) dep r1=imm1,r2,pos6,len6 Execution of this instruction produces a result similar to that shown in Figure 6.5, except that the field is deposited from the immediate operand imm1. Since the immediate operand is one bit wide, the only values are 0 and 1. The value specified as imm1 is signextended to the length specified by len6 and then deposited into r3, starting at the position picked with pos6. Notice that len is now six bits wide and can specify a length from 1 to 64 bits. The last two forms of the instruction have the completer .z, which stands for zero. For this reason, they are known as the zero form register and zero form immediate formats. The dep.z instructions perform the same function as their nonzero counterparts, but they deposit the field of bits directly into the destination register starting with the bit position identified by pos6, then clear all other bits of the destination operand. Notice that the immediate operand (imm8) is now eight bits wide. Example What dep.z instruction would extract a 4-bit field from an immediate operand that has a value equal to FFH and deposit it into destination register Gr10 starting at bit position 8?
What result is produced in the register when the instruction is run? Assume that the execution of the instruction is not conditional on the value of a predicate. Solution The values of imm8, len6, and pos6 are FFH, 4H, and 8H, respectively. Therefore, the instruction is p0 dep.z r10=FFH,8H,4H When executed, bit positions 8, 9, 10, and 11 are made 1 and all others are cleared to 0. This gives Gr10 = 0000000000000F00H
TeamUnknown Release
Chapter 6 - Integer Computation, Character String, and Population Count Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
CHARACTER STRING AND POPULATION COUNT INSTRUCTION GROUPING Another small set of special instructions, called the character string and population count instructions, perform bit-wise processing of data in a general register. They can be used to scan the bits of data in a source operand to find out whether or not an 8-bit or 16-bit element consisting of all 0s exists, or to count the number of bits in the value that are logic 1, respectively. The application programmer can use these instructions to simplify the implementation of operations on character and bit-field data.
Character String Instruction In Table 6.7, the first base character string instruction is the compute zero index (czx). The czx instruction scans the value in the general register (r3) looking for a zero element. The element containing all 0s can be either an aligned byte (8 bits wide) or an aligned pair of bytes (16 bits wide). That is, the czx instruction treats the general register source as either eight 1-byte or four 2-byte elements. For this reason, you see both a czx1 (one-byte form) and czx2 (two-byte form) form of the instruction in Table 6.7. Moreover, either of these searches can be started from the least-significant bit (right form) end of the register or the most-significant bit (left form) end. The scan direction is specified by adding a completer—.r for right form or .l for left form—to the mnemonic. Table 6.7: Character String and Population Count Instructions Mnemonic
Operation
Format
czx
Compute zero index
(qp) czx1.l r1=r3 (qp) czx1.r r1=r3 (qp) czx2.l r1=r3 (qp) czx2.r r1=r3
popcnt
Population count
(qp) popcnt r1=r3
Look next at the results produced when a czx instruction is executed. If a valid zero element is found, an index that identifies the byte or word that contains the element is placed in the destination register Table 6.8 summarizes the possible results that can be produced by the czx instruction. Table 6.8: Result Ranges for czx Size
Element width
Range of result if zero element found
Default result if no zero element found
1
8 bit
0–7
8
2
16 bit
0–3
4
For the one-byte form (size = 1 and element width = 8-bit), the aligned byte locations in the source operand, starting from the least- significant byte, are identified with the index 0H through 7H. Actually, the value in r3 could contain more than one aligned zero element. For this case, the index of the first zero element that is identified is put into the destination register. If no aligned zero byte-wide element is found, the default value placed in r1 is 8H. In this way, we see that the left scan and the right scan of the same piece of data could produce different results. An example is the instruction: (p0) czx2.r r10=r5 This is the two-byte form, right form of the instruction. It specifies a scan that starts from the least-significant bit of the value in source register Gr5 looking for an aligned 16-bit zero element. If a valid zero element is located, the value 0H, 1H, 2H, or 3H corresponding to that byte-pair is placed in Gr10. If none is found, the value in Gr10 becomes 4H.
Population Count Instruction The other instruction in Table 6.7 is the population count (popcnt) instruction. This instruction also tests the bits of the source operand general register (r3). However, in this case it counts the number of bits in the source register and writes the sum into the destination general register (r1). Example If the contents of Gr5 are 0120006689AA00BFH, what results do the following instructions produce? (p0) czx1.r r10=r5 ;;
(p0) czx1.l r10=r5 ;; (p0) popcnt r10=r5 Solution The value of the source operand is written in binary form as: Gr5 = 00000001001000000000000001100110100010011010101000000000101 111112 Since the czx instructions are one-byte form, the bits are broken into aligned bytes as: Gr5 = 00000001 00100000 00000000 01100110 1000100110101010 0000000010 1111112 When executed, the first instruction scans the value in Gr5 from the right to identify the first aligned byte-wide zero element. Starting from the least-significant bit end, we find that the second aligned byte element is all 0s; therefore, the index placed in Gr10 is 1H. The second instruction performs the same scan operation on Gr5 except it is a left scan and starts from the most-significant bit position. In this case, the first aligned byte-wide zero element is in position 6. Therefore, the index value 5H results in Gr10. Finally, the popcnt instruction counts the number of bits in Gr5 that are logic 1, making Gr10 equal 20 (14H).
TeamUnknown Release
Chapter 7 - Compare and Branch Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Chapter 7: Compare and Branch Instructions Chapter 6 introduced the integer computation instructions of the Itanium processor’s instruction set. In this chapter, we continue our study of the ISA with the compare and branch instruction groups. A number of software structures and architectural capabilities are also introduced in conjunction with these instructions. They include: predication; elimination of branches with predication; register stack and the function call process; counted, while, and pipelined loops; and rotating registers.
QUALIFYING INSTRUCTION EXECUTION WITH A PREDICATE In Chapter 1, we introduced predication as the method by which the execution of an instruction is made conditional. The chapters that followed showed that predicates are one-bit values and are stored in the predicate register file. Also, the execution of most instructions could be gated by a qualifying predicate. A 0 predicate is interpreted as False and a 1 predicate is interpreted as True. If the predicate is True, the instruction executes normally. If the predicate is False, the instruction does not modify the application state. We have found that the predicate register used to determine whether or not an instruction is executed is specified in the instruction format by an optional qualifying predicate. Some examples of predicated instructions are: (p1) add r1=r2,r3 (p2) ld8 r5=[r7] (p3) chk.s r4,recovery For instance, if the state of Pr1 is True when the first instruction is executed, the add operation is performed normally and the state of register Gr1 is updated with the result produced by adding the values in Gr2 and Gr3. On the other hand, if the value of Pr1 is False, the instruction behaves like a nop and the value in Gr1 is unchanged. Remember that the execution of an instruction can be made unconditional by specifying
Pr0 as the qualifying predicate because the value in Pr0 is hardwired to 1. Nonpredicated instructions in the output of the compiler are expressed without a predicate register. This notation implies that they are qualified by Pr0. Earlier we also indicated that a few instructions do not support a qualifying predicate. These exceptions are the instructions allocate stack frame (alloc), clear rrb (clrrrb), flush register stack(flushrs), and counted branches (cloop, ctop, and cexit).
TeamUnknown Release
Chapter 7 - Compare and Branch Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
COMPARE INSTRUCTION GROUP The instructions in the compare instruction group are used to modify the states of predicate registers. They can test source operands for a variety of conditions, and based on the results, set or reset destination predicate registers. Then the values in the predicate registers can be used to affect program execution in two ways: as conditions for conditional branches or as qualifying predicates for instructions. The instructions in the compare instruction group are shown in Table 7.1. They include the compare, test bit, and test NaT bit instructions. Next we will examine the operation of each of these instructions. There are also floating-point compare instructions, but they will be examined in a later chapter on the Itanium processor’s floating-point architecture. Table 7.1: Compare Instruction Formats Mnemonic
Operation
Format
cmp
Compare
(qp) cmp.crel.ctype p1, p2=r2, r3 (qp) cmp.crel.ctype p1, p2=imm8, r3 (qp) cmp.crel.ctype p1, p2=r0, r3 (qp) cmp.crel.ctype p1, p2=r3, r0
cmp4
Compare word
(qp) cmp4.crel.ctype p1, p2=r2, r3 (qp) cmp4.crel.ctype p1, p2=imm8, r3 (qp) cmp4.crel.ctype p1, p2=r0, r3 (qp) cmp4.crel.ctype p1, p2=r3, r0
tbit
Test bit
(qp) tbit.trel.ctype p1, p2=r3, pos6
tnat
Test NaT bit
(qp) tnat.trel.ctype p1, p2=r3
TeamUnknown Release
Chapter 7 - Compare and Branch Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
GENERAL REGISTER COMPARE INSTRUCTIONS There are two general register compare instructions, compare (cmp) and compare word (cmp4), in the instruction set. The first general format of the compare instruction in Table 7.1 is (qp) cmp.crel.ctype p1,p2=r2,r3 As this instruction executes, the value of the source operand in general register r3 is compared to that in r2. Based on the outcome of this comparison, a Boolean result is produced in the two destination predicate registers, which are denoted in general as p1 and p2. The mnemonic for the instruction can be appended with two completers to further define the compare operation that it is to perform. These completers are compare relationship (.crel) and compare type(.ctype). The second format of the compare instruction shown in Table 7.1 simply replaces the source operand register, which is identified in general as r2, with an immediate operand, imm8.
Compare Relationship Completers Let us first look at the compare relationship completer and how it is used to specify the conditional test that is to be made on the source operands in the general registers. Using the crel completer, one of 10 relationships can be specified for the comparison operation performed by the compare instruction. Table 7.2 lists these relationships. Notice that they include comparisons to determine whether or not the two values are equal, or if one is greater than or less than the other. Under the compare relation column in this table, a stands for the content of general register r2 and b represents the content of general register r3. For example, if the completer .neis added to the compare instruction, the values in r2 (a) and r3 (b) are tested to determine if they are not equal. The inequality relationships—greater than, less than, greater than or equal to, and less than or equal to—can be implemented to test the values in r2 and r3 as either signed or
unsigned numbers. For instance, the completer .lt tests to determine if the signed value in r2 (a) is less than that in r3 (b). The unsigned form of this instruction is written using the completer .ltu. Table 7.2: Compare Relationship Completers for Normal and Unconditional Instruction Types crel
Meaning
Compare relation
Immediate range
eq
Equal
a==b
–128 .. 127
ne
Not equal
a
b
–128 .. 127
lt
Less than signed
a
–128 .. 127
le
Less than or equal to signed
a=b
–127 .. 128
gt
Greater than signed
a>b
–127 .. 128
ge
Greater than or equal to signed
a=b
–128 .. 127
ltu
Less than unsigned
a
0 .. 127
leu
Less than or equal to unsigned
a=b
1 .. 128
gtu
Greater than unsigned
a>b
1 .. 128
geu
Greater than or equal to unsigned
a=b
0 .. 127
Table 7.2 also shows the range of immediate operand values allowed for each of the completer relationships. Since the immediate operand is limited to eight bits, the range of signed numbers that it can represent is from –128 through +127.
Comparison Type Completer Earlier we pointed out that when the compare instruction is executed it produces Boolean results in the destination predicate registers, p1 and p2. Based on the comparison type completer employed in the instruction, the predicate registers can be updated in a number of different ways. A list of the comparison types supported by the general register compare instruction, and information about how they affect the predicate registers is given in Table 7.3. Notice that the Boolean update produced in the predicate registers when a compare instruction is executed does not depend only on the result of the comparison operation. It is also affected by the state of the qualifying predicate of the instruction and the NaT bits of the source operands. Let us next look at the impact of selecting the normal and unconditional (.unc) completer
for a compare instruction. Normal compare operation is selected by not adding a ctype completer after the crel completer in the compare instruction mnemonic. If the value of the qualifying predicate Pr[qp] is 0 as the instruction is executed, Table 7.3 shows that the state of the predicates is unchanged. If Pr[qp] is set to 1, the predicate registers are updated based on the result produced by the compare. For example, if the result of the compare operation is False (0), the predicate registers are made: Table 7.3: Comparison Type Completers ctype
Meaning
Pr[qp]=0
Pr1
Pr2
Pr[qp]= =1 Result]= =0,no source NaT
Result]= =1,no source NaT
One or more
Pr1
Pr2
Pr1
Pr2
Pr1
Pr2
0
1
1
0
0
0
0
1
1
0
0
0
1
1
0
0
0
0
None
Normal
unc
Unconditional
or
OR
orcm
OR complement
1
1
and
AND
0
0
andcm
AND complement
0
0
or.andcm
DeMorgan OR/AND
1
0
and.orcm
DeMorgan AND/OR
0
0
0
source NaTs
1
Pr1 = 0; Pr2 = 1 On the other hand, if the result of the comparison operation is True(1), the opposite result is produced: Pr1 = 1; Pr2 = 0 In this way, we see that the normal type simply writes the result of the compare to the first predicate and the complement to the other. If the NaT bit for either of the source operands is set when the instruction is executed, the result of the compare operation changes. Remember that logic 1 in an NaT bit means that the general register is currently involved in a control speculation execution sequence
and that a deferred exception has occurred. Notice in Table 7.3 that in this case both of the destination predicate registers are cleared. The complete list of compare relationships given in Table 7.2 can only be used with instructions that employ the normal and unconditional comparison types. Example Describe how the normal and unconditional compare operations differ. Solution Table 7.3 shows that the unconditional type behaves the same as the normal type, except when the qualifying predicate is 0. In this case, the normal compare operation has no effect on the destination predicate registers, but the unconditional compare operation causes the value 0 to be written into both of them. In this way, we see that the result produced by an unconditional compare instruction represents an exception to the predication process. Example What results are produced in predicate registers Pr5 and Pr6 if the value in Pr3 equals 1, Gr5 equals FFFFFFFFFFFFFFFFH, and Gr10 equals 0000000000000000H, when the following instruction is executed? (p3) cmp.ne.unc p5,p6=r5,r10 How does the result change if the value in Pr3 is 0 when the instruction is executed? What results are produced in both cases if NaT5 is logic 1 when the instruction is executed? Solution In the first case, the instruction is enabled to execute because its qualifying predicate in Pr3 equals 1. Therefore, the following compare operation is performed: Gr5 Gr10 FFFFFFFFFFFFFFFFH
0000000000000000H
Since the result of the comparison is True, the state of the destination predicate registers is made: Pr5 = 1; Pr6 = 0 For the second case, the qualifying bit is 0. Therefore, the destination predicate registers are both cleared to 0.
Pr5 = 0; Pr6 = 0 The set NaT bit for the last case indicates that a deferred exception has occurred relative to general register Gr5. Again, both destination predicate registers are cleared to produce this result: Pr5 = 0; Pr6 = 0
Parallel Comparison Types The AND, OR and DeMorgan comparison types are called parallel compare types because they update the destination predicate register only for a particular comparison result. This allows multiple simultaneous OR-type or multiple simultaneous AND-type comparisons to specify the same predicate register. The DeMorgan comparison type is just a combination of an OR type to one destination predicate and an AND type to the other destination predicate. The parallel comparison types cannot be used with all of the operand configurations in Table 7.1 and completer relationships in Table 7.2. Table 7.4 summarizes the .crel completer relationships available for use with the parallel comparison types. For example, the AND, OR, and DeMorgan comparison types can be used with instructions that perform either the equal or not equal comparisons between the values in two general registers or between that of a general register and an immediate operand. However, if the comparison relationship represents an inequality comparison, such as less than or greater than, the only operand configuration supported is to compare the value in a general register and that in general register 0. That is, only the third and fourth compare instruction formats in Table 7.1 apply. Notice in Table 7.4: Compare Relationship Completers for Parallel Types crel
Meaning
Compare relation
Immediate range
eq
Equal
a==b
–128 .. 127
ne
Not equal
a != b
–128 .. 127
lt
Less than signed
0
Immediate form not supported
lt
Less than signed
a<0
Immediate form not supported
le
Less than or equal to signed
0 <= b
Immediate form not supported
crel
Meaning
Compare relation
Immediate range
le
Less than or equal to signed
a <= 0
Immediate form not supported
gt
Greater than signed
0>b
Immediate form not supported
gt
Greater than signed
a>0
Immediate form not supported
ge
Greater than or equalto signed
0 >= b
Immediate form not supported
ge
Greater than or equal to signed
a >= 0
Immediate form not supported
Table 7.4 that unsigned forms of the inequality relationships are also not supported for instructions using the parallel type completers. An example is the instruction: (p0) cmp.le.or p5,p6=r0,r10 When this instruction is executed, the value in Gr10 is compared to that in Gr0. Remember that the contents of Gr0 are hardwired as 0000000000000000H. If the value in Gr0 is less than or equal to the signed value in Gr10, which means the comparison is True (1), Table 7.3 shows that the values of both Pr5 and Pr6 are set to 1. On the other hand, if the comparison results in False (0), Pr5 and Pr6 are left unchanged.
Compare Instructions Implemented as a Pseudo-operation Not all of the compare relationships are directly implemented in the Itanium processor hardware. Those that are not actually provided for in hardware are implemented with pseudo-operations. That is, the compiler replaces the instruction with an alternate instruction that performs the equivalent function by using an implemented relationship. The source operands and the predicate registers may also need to be interchanged to implement the function. Table 7.5 identifies the compare relationships for the normal and unconditional types that are performed as pseudo-operations and tells how the pseudo-op instruction is implemented. For instance, the compiler performs the register compare form of the not equal (.ne) relationship by using the equal completer and interchanging the predicate registers. Therefore, the compiler implements the operation performed by the instruction (p0) cmp.ne.and p5,p6=r5,r10
Table 7.5: Compare Relationships Implemented as Pseudo-operations for Normal and Unconditional Types Register form pseudo-op
Immediate form pseudo-op
crel
crel
Predicates
crel
ne
eq
p1 « p2
eq
le
lt
a«b
p1 « p2
lt
a-1
gt
lt
a«b
lt
a-1
ge
lt
leu
ltu
a«b
gtu
ltu
a«b
geu
ltu
Operands
Operands
p1 « p2
p1 « p2
lt
p1 « p2
ltu
a-1
ltu
a-1
p1 « p2
ltu
Predicates
p1 « p2 p1 « p2 p1 « p2 p1 « p2
by replacing it with (p0) cmp.eq.and p6,p5=r5,r10 Another example of a pseudo-op is the register form of the signed less-than-or-equal-to relationship. It is implemented with an instruction that uses the less-than comparison relation with both the source registers and predicate registers interchanged. Example Show what instruction the compiler uses to implement the pseudo-op instruction (p0) cmp.le.orcm p5,p6=r5,r10 Solution Table 7.5 shows that the .le conditional relationship is replaced by.lt, the predicate registers are interchanged. This gives (p0) cmp.lt.orcm p6,p5=r10,r5
Compare Word Instruction The operation of the compare word (cmp4) instruction is similar to the cmp instruction. Looking at Table 7.1, we find that the formats of the compare word instruction are identical to those of the compare instruction. Moreover, the compare relationship and compare type completers used to define the compare operation are the same as those given in Tables 7.2, 7.3, and 7.4. The only difference in this instruction is that it compares the least-significant 32 bits of the source operands, instead of the full 64 bits.
Example How would the compiler express this pseudo-op instruction? (p0) cmp4.gt.and p5,p6=0F1H,r10 What results are produced if the value in Gr10 is FFFFFFFF00000000H when the instruction is executed? Solution Table 7.5 shows that to convert the pseudo-op instruction to a real instruction, the compare relationship is changed to .lt, 1 is subtracted from the 8-bit immediate operand, and the predicates are interchanged. This results in (p0) cmp4.lt.and p6,p5=0F0H,r10 When the instruction is executed, the immediate operand is first sign-extended to 32 bits to give FFFFFFF2H and then it is compared to the value in the lower 32 bits of Gr10, which is 00000000H. Therefore, the less-than inequality states that FFFFFFF2H < 00000000H This inequality is valid. Since the AND comparison type completer is used in the instruction and the result is 1, Table 7.3 shows that the predicate registers are left unchanged.
TeamUnknown Release
Chapter 7 - Compare and Branch Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
TEST BIT AND TEST NAT BIT INSTRUCTIONS The last instructions of the compare instruction group shown in Table 7.1 are the test bit (tbit) instruction and test NaT bit(tnat) instruction. They are used to test the logic state of bit-wide data and produce a pair of destination predicates that reflect the result. Let us begin by examining the format and operation of the tbit instruction. The general format for this instruction is given in Table 7.1. This instruction provides the ability to test the value of a specific bit in a source general register, which is denoted in general as r3, and reflect the result of this test with predicate bits p1 and p2. The position of the bit in r3 is specified by the immediate operand pos6. With this 6-bit value any one of the 64 bit positions in r3 can be chosen. A test relationship (trel) completer that is appended to the instruction mnemonic defines the test performed on the selected bit in r3. Table 7.6 shows that there are just two choices for trel, not zero (.nz) or zero (.z), and that they are used to test for logic 1 and logic 0, respectively. Just like for the compare instruction, a comparison type completer selected from Table 7.3 determines how the predicate registers are updated for the True (1) and False (0) results of the test. Table 7.6: Test Bit and Test NaT Bit Relationships for Normal and Unconditional Type Completers trel
Test relation
nz
Selected bit = = 1
z
Selected bit = = 0
An example is the instruction (p3) tbit.nz.unc p5,p6=r5,0FH When executed with Pr3 equal to 1, it tests bit 15 in Gr5 for the logic 1 state. If the value of bit 15 is 1, the result is True and produces Pr5 = 1; Pr6 = 0
On the other hand, if the value of bit 15 is found to be 0, the result is Pr5 = 0; Pr6 = 1 Remember that if Pr3 is 0 when this instruction is executed the destination predicate registers are both cleared. The tnat instruction tests the value in the NaT bit corresponding to the general register r3 specified as the source operand. Based on the test operation defined by the .trel completer selected fromTable 7.6, and .ctype completer from Table 7.3, and the value of the NaT bit, an appropriate result is produced in the destination predicate registers, p1 and p2. In this way, we see that this instruction can be used to detect the occurrence of a deferred exception and signal this condition by setting or resetting predicate bits. Example Describe the operation performed by the following instruction when the value in NaT5 is 0. (p0) tnat.z.or p5,p6=r5 What does the result mean? Solution The instruction tests the state of NaT5 to confirm that it is 0. Since the test relation is satisfied, the destination operands are updated based on the OR type completer. This gives Pr5 =1; Pr6 = 1 This result means that a deferred exception does not exist.
The .z test bit relationship is directly implemented in the hardware of the Itanium processor for normal and unconditional type test bit and test NaT bit instructions. The .nz relationship for these two comparison types is implemented as a pseudo-op and is coded by the compiler using the .z completer. To create the instruction, the compiler replaces the .nz completer with the .z completer and interchanges the predicate registers. For instance, our earlier example instruction (p3) tbit.nz.unc p5,p6=r5,0FH is a pseudo-op. This instruction is implemented as (p3) tbit.z.unc p6,p5=r5,0FH
For the parallel comparison types, both the .z and .nz test relations are implemented in hardware.
TeamUnknown Release
Chapter 7 - Compare and Branch Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
BRANCH INSTRUCTION GROUP Up to this point in the chapter we have introduced the operation of the compare and test instructions that can set or reset the states of predicate registers, and the use of these predicates to qualify the execution of instructions. Most of the instruction sequences we have considered up to this point have executed sequentially. It is very important in practical application programs to be able to conditionally change the order in which instructions are executed. In this section, we introduce the branch instruction group and the use of branch instructions in implementing conditional branch program structures. The section that follows shows how predication may be used to eliminate conditional branch operations. Branch instructions are also the building block for the Itanium processor’s function call process and loop program structures. These topics are also introduced later in this chapter.
Types of Branch Operations and Instructions A change in control flow in a program is known as a branch. In the Itanium architecture, this operation is initiated with the conditional form of the branch (br) instruction. The formats of the branch instruction are given in Table 7.7. The branch mnemonic can be appended with a variety of completers to implement the different types of branch operations needed in software applications. Choices for the first completer, branch type (btype), are listed in Table 7.8. The first branch type, which is the one used to specify a conditional branch operation, is designated by adding either .cond or no completer to the mnemonic. Other completers are used to create specialized branch instructions for function call, function return, and counted loop applications. Table 7.7: Branch Instruction Formats Mnemonic
Operation
Format
br
Conditional branch
(qp) br.btype.bwh.ph.dh target25 (qp) br.btype.bwh.ph.dh b2
Mnemonic
Operation
Format
Call branch
(qp) br.btype.bwh.ph.dh b1=target25 (qp) br.btype.bwh.ph.dh b1=b2
Counted branch
br.btype.bwh.ph.dh target25 br.ph.dh target25 br.ph.dh b2
Table 7.8 shows the branch condition employed by the different forms of the branch instructions. That is, the condition they test to determine whether or not the branch is to take place. For instance, the conditional branch, call branch, and return branch all use the state of a qualifying predicate (qp) as the branch condition. If this predicate is logic 1 when the instruction is executed, the branch in program control is taken; otherwise, the next sequential instruction is executed. Table 7.8: Branch Types Completers Mnemonic
Function
Branch condition
Target address
cond (or None)
Conditional branch
Qualifying predicate
IP-rel or Indirect
call
Conditional function call
Qualifying predicate
IP-rel or Indirect
ret
Conditional function return
Qualifying predicate
Indirect
cloop
Counted loop branch
Loop count
IP-rel
ctop, cexit
Modulo-scheduled counted loop
Loop count and epilog count
IP-rel
wtop, wexit
Modulo-scheduled while loop
Qualifying predicate and epilog count
IP-rel
When a branch is taken, the next instruction to be executed is located at the target address specified by the branch instruction. Therefore, the branch instruction must change the value in the instruction pointer so that it points to the storage location in memory of the new instruction. In Table 7.8, we see that there are two ways of specifying the target address of a branch instruction. They are called the IP-relative target address and indirect target address.
The Conditional Branch Instruction
The first instruction format given in Table 7.7 for the conditional branch operation employs IP-relative addressing. This format is written in general as: (qp) br.btype.bwh.ph.dh target25 Here the target address is specified with a 21-bit signed displacement, which is identified in the instruction format as immediate operand target25. Remember that code is fetched as 16-byte bundles; therefore, the four least significant bits of IP are ignored. This is why target25 is encoded into the instruction as a 21-bit immediate operand. When the instruction is executed and the branch is to be taken, the value of the immediate operand is added to the value of the IP for the bundle containing the branch to give the address of the target bundle. A 21-bit signed displacement, which is shifted left by4 bits, allows a branch reach of 16 M bytes. An example is the instruction (p5) br.cond label Assume that the value of the IP is 0000F00000010000H and that the displacement between the address in IP and that of label is 0FFFE0H bytes. The value 0FFFEH is coded into the 21-bit immediate operand field of the branch instruction. When the instruction is executed, the value in predicate register Pr5 is tested. If this value is 0, the next sequential bundle of instructions, which are at address 0000F00000010010H, are fetched from memory. However, if Pr5 equals 1, the value of the immediate operand is shifted left 4 bits to recover the displacement value 0FFFE0H. Then, the address used to access the next instruction bundle in memory is calculated as 0000F00000010000H + 0FFFE0H = 0000F0000010FFE0H Since the four least-significant bits of the instruction pointer and immediate operand are ignored, target addresses are always bundle aligned. The second format in Table 7.7 describes a conditional branch instruction that employs an indirect target address: (qp) br.btype.bwh.ph.dh b2 In this case, source operand b2 identifies a branch register that holds the value of the target address that is to be loaded into the IP. An example is the instruction (p5) br b10
Branch Hints
The compiler has the ability and resources to analyze the behavior of branch operations in an application program. The compiler communicates information about the usage of the branch instructions to the Itanium processor hardware through three branch hint completers. They are the branch whether hint (.bwh), sequential prefetch hint (.ph), and cache deallocation hint (.dh). Transfer of this knowledge to the processor helps to reduce penalties associated with the branch operation. The branch whether hint is used to suggest to the processor whether or not it should use branch prediction resources to dynamically predict a branch. The available options for .bwh are listed in Table 7.9. For example, specifying static not-taken (.spnt) for .bwh means that the branch is not normally taken and that the processor should not use prediction to perform the routine. Similarly, the static taken (.sptk) completer tells the compiler that the branch is normally taken and again that prediction resources should not be used in support of the branch operation. On the other hand, the dynamic not-taken (.dpnt) and dynamic taken (.dptk) hints indicate to the hardware that it should consider applying prediction resources to the execution of the branch sequence. Table 7.9: Branch Whether Hint Completers bwh
Meaning
spnt
Static not-taken
sptk
Static taken
dpnt
Dynamic not-taken
dptk
Dynamic taken
Another way of limiting the impact of branches that are not taken is to limit the amount of code that is prefetched. This is the purpose of the sequential prefetch hint. Table 7.10 shows the allowed options forthe .ph completer. Using the completer .few, the compiler tells the processor to only prefetch a few instructions at the branch-to-target address. On the other hand, the .many completer means that more lines of code should be prefetched. Table 7.10: Sequential Prefetch Hint Completers ph
Meaning
few or none
Few lines
many
Many lines
The last hint, branch cache deallocation hint (.dh), is used to tell the processor hardware whether or not the branch is expected to be reused. Table 7.11 lists the allowed values for .dh. Table 7.11: Branch Cache Deallocation Hint Completers
dh
Meaning
none
Don’t deallocate
clr
Deallocate branch information
Normally, prediction resources keep track of the most-recentlyexecuted branches. If the compiler knows that a branch will soon be used again, it marks the branch instruction with the don’t deallocate hint, which is specified by not adding a branch cache deallocation hint to the branch mnemonic. Appending the mnemonic with .clr means that the branch information is not expected to be reused in the near term and should be deallocated. In this way, the compiler indicates which code should be maintained in the cache, thereby reducing cache misses. Example Describe the branch operation that is performed by the instruction (p10) br.dptk.many b5 Solution When predicate register Pr10 contains the value 1, a branch is initiated to the bundle of instructions that are pointed to by the target address in branch register Br5. Hints are provided to suggest to the hardware that the branch is likely to be taken, that prediction resources should be used in support of its execution, and that more instructions should be prefetched and maintained in the cache.
TeamUnknown Release
Chapter 7 - Compare and Branch Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
ELIMINATING CONDITIONAL BRANCHES WITH PREDICATION In Chapter 1, branches in program execution were identified as a major factor in reducing the performance of an application. Remember that predication can potentially be used to eliminate conditional branch operations from a program. One of the most important uses of the Itanium processor’s predication mechanism is optimizing application performance by eliminating conditional branch operations. Using this method, the compiler removes the branch and replaces it with conditional execution of code in parallel. The decision whether or not a branch operation is to be replaced through predication is made by the compiler. Let us next look at how predication is used to remove a conditional branch operation from a program. Figure 7.1(a) illustrates a typical conditional branch instruction sequence. Notice that a compare instruction followed by a conditional branch instruction is located in the sequential block of code. Based on the result of the comparison of parameters a and b, an abrupt change in program flow may occur. If a correct prediction is made, the work done in preparing instructions for execution can be saved. For instance, if a prediction is made that the result of the comparison is True, the processor would be processing the instructions along this path and have them decoded and waiting in the instruction queue for execution. If the wrong path is chosen, the processor must discard this code by flushing it from the execution pipeline and reload the correct code from memory.
Figure 7.1: Branch Elimination through Predication To improve the performance of branching, Itanium architecture uses the predication technique. Instead of making a predicted change in program flow at a conditional branch, predication replaces the branch operation with conditionally executed instruction sequences implementing both of its branch paths. Both paths of code are conditionally executed in parallel, but only the results for the path that is to be taken are used to update the application state. It is more efficient for the Itanium processor to execute both paths of code and just ignore the unwanted results. In this way, we see that predication eliminates the flush and refetch penalties that result from branch mispredicts, thereby increasing the performance. Figure 7.1(b) shows how this conditional branch sequence is implemented using predication. A separate predicate is assigned to the instructions in each path. In the example, these predicates are called p1 and p2 and are part of a compare operation that replaces the comparison and branch instruction sequence in the original conditional branch program. Here p1 corresponds to the predicate register that is used for conditional execution of path 1, which is the sequential flow path in Figure 7.1(a), and p2 corresponds to the predicate register that controls the execution of path 2, which represents the change in program flow path in Figure 7.1(a). The instructions corresponding to both paths are prepared for execution in parallel, but not executed prior to execution of the compare instruction. The compare instruction in this example is used to decide whether or not the values of variables a and b are equal. When this comparison operation is performed, the result produced by the comparison is used to update the value of p1 and p2. If the comparison is True, which means that a equals b, then p1 and p2 are assigned values of 1 and 0,
respectively. If the comparison is False, a is not equal to b, then p1and p2 are made 0 and 1, respectively. Then, the code for both path 1 and path 2 is executed in parallel. Only the instructions for the path with its predicate set to True will result in the update of the application state. For instance, if p1 is False and p2 True, the results produced by the code qualified by p2 are used. Therefore, the results produced are those corresponding to the change in program flow code sequence. Figure 7.2(a) is an example of a C coded IF-THEN-ELSE program structure. Here the value in r1 determines which computation is performed. The assembly instruction sequence in Figure 7.2(b) shows how this software function is implemented with a conditional branch operation. If the branch predicts the wrong path, this version can require many more clocks than the predicated version shown in Figure 7.2(c) and is never faster. The compiler uses predication to produce an optimized replacement for the IF-THENELSE operation. The code in Figure 7.2(c) replaces the conditional branch operation. Here the conditional comparison instruction cmp.ne determines whether or not the value in Gr1 is 0. If Gr1 holds the value 0, the compare relationship is True and predicates are made Pr5 = 0 and Pr6 = 1; otherwise, they are made Pr5 = 1 and Pr6 = 0. The THEN and ELSE computational paths of the C program are performed by conditional add and subtract instructions, respectively. The THEN path is conditional on Pr5 equal to 1 and the ELSE path on Pr6 equal to 1. As the program runs, both the add and subtract instructions are executed, but the results produced by the instruction with a True qualifying predicate are used. For the case where the comparison relationship is satisfied, Pr5 is 1 and only the add instruction updates the application state. if (r1) r2 = r3 + r4; else r7 = r6 - r5; (a) C coded IF-THEN-ELSE program structure cmp.eq br.cond add br else_clause: sub end_if: (p5)
p5,p6=r1,r0 else_clause r2=r3,r4 end_if
//Cycle //Cycle //Cycle //Cycle
0 0 1 1
r7=r6,r5
//Cycle 1
(b) Nonoptimized predicted code implementation (p5)
cmp.ne add
p5,p6 = r1,r0;; r2 = r3,r4
//Cycle 0 //Cycle 1
(p6)
sub
r7 = r6,r5
//Cycle 1
(c) Optimized solution employing prediction Figure 7.2: Optimized and Nonoptimized Predication for IF-THEN-ELSE Program Structure Notice that the compare instruction is terminated by ;; to assure that it is in a separate instruction group from the arithmetic instruction. In this way, the comparison operation is assured to occur before the arithmetic calculation, rather than at the same time. Even though predication can improve performance by eliminating branch mispredictions, the compiler does not replace all branch operations in an application program with predicated code. For example, a branch operation that has one path that takes much longer to execute than the other may not be replaced using predication. This condition is known as an unbalanced branch. If it turns out that the code for the shorter path is taken more frequently, the execution resources of the processor are not efficiently used because they are always also consumed performing the code for the longer, less frequently used path. In this case, the compiler may elect to leave the conditional branch code in place, but add hints to the instruction to help the processor correctly predict the path.
TeamUnknown Release
Chapter 7 - Compare and Branch Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
OPTIMIZING CODE WITH THE PARALLEL COMPARE OPERATION The previous section demonstrated use of predication to eliminate a single branch operation. A program sequence might involve a more complex relational expression that requires multiple compares and branches. Figure 7.3(a) provides a sample C program structure. The conditional relationship defined by the C statement is the expression: ((A = B) AND (B = C) AND (C = D) AND (D = E)) The operation performed by this relationship is called a parallel compare. That is, the result is True only if each individual comparison is true. The compiler could produce the assembly language sequence in Figure 7.3(b) to implement this conditional expression. Here the compiler has used normal compare instructions that can only compare two numbers at a time. For instance, the first cmp.ne instruction compares the values in Gr2 and Gr3 and if they are not equal makes Pr1 True. If any of these conditional tests are True, a branch is taken that skips the computation of the result in Gr7. Notice that it takes three clocks, after the values of A, B, C, D, and E are known, to compute the predicates, to determine whether the result is to be computed, and to calculate the value of the result.
if ((r2 == r3) && (r3 == r4) && (r4 == r5) && (r5 == r6)) r7 = r8 + r9; (a) C coded parallel compare program structure
(p1) (p2)
(p3) (p4)
cmp.ne cmp.ne br.cond br.cond cmp.ne cmp.ne br.cond br.cond add
p1,p0=r2,r3 p2,p0=r3,r4 skip skip p3,p0=r4,r5 p4,p0=r5,r6 skip skip ;; r7=r8,r9
// cycle 0
// cycle 1† // cycle 2†
L1: (b) Nonoptimized assembly code implementation cmp.eq p1,p0=r0,r0;; cmp.eq.and p1,p0,r2,r3 cmp.eq.and p1,p0,r3,r4 cmp.eq.and p1,p0,r4,r5 cmp.eq.and p1,p0,r5,r6 ;; (p1) add r7=r8,r9
// initialize p1 to 1‡
// cycle 0 // cycle 1
(c) Optimized implementation Figure 7.3: Optimizing Code with the Parallel Compare Operation Notes: † plus any mis-predicted branch penalty time ‡ Initialization not counted in time here; it is not on a critical path (can happen any time earlier). Using the parallel compare capability of the instruction set, the compiler can optimize the program sequence for this parallel compare expression. The special parallel compare instructions come in versions that either write True or do nothing, or versions that either write False or do nothing. Figure 7.3(c) shows the instruction sequence optimized using parallel compare operations. For the computation of ((A=B) AND (B=C) AND (C=D) AND (D=E)), you first set a predicate register, such as Pr1, to True. This initialization operation can be done at any time in conjunction with other operations. Therefore, it does not require a clock. Once the values of A, B, C, D, and E are known, the relationships (A=B), (B=C), (C=D), and (D=E) are each separately tested with the cmp.eq. and instruction. These parallel compare instructions also use Pr1 as their result and are all computed in clock0. If the two values are not equal, the compare relationship is not satisfied and the predicates are set to False. Otherwise, they stay unchanged (True).
Notice that the example uses three optimizations to improve performance: All branch operations have been removed The number of clocks required to perform the operation is reduced from 3 to 2. More flexibility has been introduced for scheduling the instructions in parallel.
TeamUnknown Release
Chapter 7 - Compare and Branch Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
REGISTER STACK AND THE REGISTER STACK ENGINE Increased use of modular programming is driving the need for faster switching of the function call process. The increased size and versatility of the general register file in the Itanium processor are key elements in reducing the overhead of the function call process. Here we will examine the elements of the function call mechanism: register stack, register stack engine, function support instructions, and the function call process.
Organization of the Register Stack Remember that registers Gr32 through Gr127 of the general register file represent the stacked general registers. Each function allocates its own group of dedicated registers in the stacked general registers. This group of registers is called the register stack frame of the function. The registers of the stack frame are the only stacked registers that are visible to the function. They are temporary storage locations that hold the operands processed during the function and parameters that are input to and output from the function. Figure 7.4 illustrates the organization of stack frames in the general register file. Notice that function A creates a register stack frame that begins at Gr32. The initial size of this stack frame can range from 0 to 96 registers. The registers of the stack frame are further partitioned into two groups called local registers and output registers. When a stack frame is formed, the size of the local area is 0 and the size of the output area is equal to the size of the calling function’s output area. The size of the local and output areas are variable and adjusted through software.
Figure 7.4: Organization of the Register Stack Remember that the size information that defines the active stack frame is held in the current frame marker (CFM) register. The content of three fields in this register, size of frame (sof), size of locals (sol), and size of rotating (sor) describe the organization of the current stack frame. The variable size of the stack fame is an important feature of the IA-64 function call architecture. The stack frame can be resized using the allocate stack frame (alloc) instruction. Table 7.12 shows that the format of this instruction is: alloc r1=ar.pfs,i,l,o,r Table 7.12: Register Stack Instructions Mnemonic
Operation
Format
alloc
Allocate stack frame
alloc r1 = ar.pfs,l,l,o,r
flushrs
Flush register stack
flushrs
The meanings of the four parameters used to specify the size of the frame are: i = size of inputs immediate l = size of locals immediate o = size of outputs immediate r = size of rotating When this instruction is executed, these parameters are used to form sol, sof, and sor. The values for the sof, sol, and sor fields are calculated as: sof = i + l + o sol = i + l
sor = r where: sor = sof Then, sof, sol, and sor are loaded into their corresponding fields of the CMF register. In this way, a new stack frame is allocated in the register stack. The size of inputs parameter is not used in determining the size of the frame. It is provided as a convenience to the assembly language programmer and affects the number and definitions of IN (input) and LOC (local) register pseudonyms. An allocate stack frame instruction may be included at the beginning of the function to adjust the size of the stack frame to meet its needs. This ensures that registers are not wasted. Another operation is performed during the execution of the allocate stack frame instruction. The previous content of the previous function state (PFS) register is copied to the general register selected by destination operand r1. The PFS is application register 64 and is denoted as the source operand ar.pfs. This information is needed to preserve linkage to the prior stack frame during a nested function call.
Parameter Passing In practical applications, function calls are nested. That is, a new function is called from within an older function. In this case, the prior function is referred to as the caller and the new function that is to be initiated is known as the callee. This notation will be used in the description of the program control switching process. During the execution of nested functions, the values of variables must be passed between the caller and callee. This is known as parameter passing and is another important element of the function call process. For instance, function B in Figure 7.4 is an example of a nested function. That is, function B is called from within function A. In most applications, function B needs to process information that was created in function A. Therefore, the values of these variables must be output by function A to function B and are considered as inputs by function B. The registers in the local area of the stack frame are used to hold values that remain local to the functions. These values are said to be local because they are both created and processed within the function. That is, their initial values are not received from another function and their final values are not sent to another function. Parameter passing is the role of the output register area. This is the reason the output registers of function A overlay the stack frame of function B in Figure 7.4. When function B is initiated, a stack frame is created that is equal in size to and overlays the output register area of function A. Therefore, function A passes variables to function B by simply putting them in its output area. For this reason, this area of the local area in the stack frame can be viewed as an input area. Figure 7.5(a) shows this organization of a stack
frame.
Figure 7.5: Treatment of Functions Parameter passing also requires that function results can be returned from function B to function A. Function B can return values to function A by placing them in the part of its register stack corresponding to function A’s output registers. When the stack frame for function B is resized, a new output area is created that is dedicated to function B. Figure 7.5(b) shows the new stack frame configuration. Function B’s output area is used to pass parameters to a function, such as C, that is evoked from within function B.
Register Stack Engine The description of the register stack leads us to the question, “What happens if multiple nested functions occur and all of the stacked registers get consumed?” Resolving this problem is the role of the register stack engine (RSE). When a new function is called and there are not enough registers left in the register stack for its stack frame to be allocated,
the RSE automatically spills stack frame information to memory to free up space. This stack frame information is temporarily saved in a part of memory known as the register stack backing store. Figure 7.6(a) illustrates how the RSE manages the register stack and register stack backing store. The register stack in the general register file is shown to hold stack frames for three functions: A, B, and C. Function C is currently being executed. For this reason, the portion of the register stack used by function C is identified as the current stack frame.
Figure 7.6: Register Stack Engine Operation The function that called function C is identified as B. Function B’s stack frame is marked as “dirty,” meaning that function B is not currently being executed, but its stack frame has not yet been copied to the backing store in memory. If the RSE needs to free up additional registers in the stack frame, the content of this stack frame will be written to memory. Once this is done, the block of registers used by function B will be marked as clean and will be free for use by the RSE for a newly initiated function. The part of the register stack used by function A is identified as clean. This function is also not being executed; however, the contents of its stack frame are already spilled to memory. Notice that a copy of stack frame A is held in the register stack backing store. For this reason, the A group of registers in the register stack are available for reuse by the RSE. The RSE backing store pointer for memory stores (BSPSTORE) register holds the address of the location in memory to which the RSE will spill the next stack frame. Figure 7.6(a) shows that this is the group of memory storage locations in the register stack backing store just above those holding the copy of function A’s stack frame. The stack
frame for function B will be saved in this location if it gets spilled to memory. The address in the RSE backing store pointer (BSP) register points to the first storage location in the register stack backing store reserved for the current function. For the state of the backing store in Figure 7.6(a), the addresses in BSPSTORE and BSP are the same. Notice that the register stack backing store is not limited in size and grows from lower to higher addresses in memory. The register stack configuration (RSC) register configures the operation of the RSE. The bit fields of this register were examined in Chapter 3. They set the mode of operation; assign a privilege level to RSE operation; select an organization for RSE data transfers; and synchronize loading of the RSE. For instance, the code in bits 0 and 1, which together represent mode, are used to select one of four modes of operation for the RSE. Using mode, the aggressiveness with which the RSE pursues saves and restores of stack register frames can be set to range from eager, which means that it very aggressively pursues both loads and saves, to lazy. Another question that must be asked is “What does the RSE do with the NaT bit associated with the registers when a stack frame is spilled to the backing store in memory?” The NaT bits are saved in the RSE NaT collection (RNAT) register. If function C and B in Figure 7.6(a) execute to completion, the stack frame for function A must be restored to the register stack before the function can be resumed. This restore operation is also automatically performed by the RSE. The dirty stack frames in the register stack can also be cleaned to make room for new stack frames under software control. The flush register stack instruction performs this operation. Table 7.12 shows that the format of this instruction is flushrs When this instruction is executed, processing stops until the contents of the dirty portion of the register stack are written to the backing store in memory. For the state of the stack frame in Figure 7.6(a), execution of a flush stack register instruction would spill the contents of function B’s stack frame to memory. Figure 7.6(b) shows the new state of the stack. At completion of the flush operation, the addresses in BSPSTORE and BSP are the same.
TeamUnknown Release
Chapter 7 - Compare and Branch Instructions Itanium Architecture for Software Developers byWalter Triebel Intel Press 2000
Recommend this title?
CONDITIONAL BRANCH FUNCTION SUPPORT INSTRUCTIONS The last section introduced the organization of the register stack, register stack backing store, and the role of the RSE. Here the description of the Itanium processor’s function call mechanism continues with the structure of a function call instruction sequence and its operation.
Structure of a Function Call Figure 7.7 outlines the structure of a typical function call program sequence. The function call program structure has four sections. The first element is the instruction that calls for the operation of the function. This instruction is located either in the main body of the program or within another function.
(p0)
. . . br.call b0=new_func . . .
// Main program or // old function //Function call //Other instructions
new_func: alloc r32=ar.pfs,0,10,5,7 inst_1 ;; inst_2 .
//Resize stack frame
. . inst_n br.ret b0
//Function body //Return
Figure 7.7: Structure of a Function The instruction that is used to initiate a function is the conditional function call (br.call ) version of the branch instruction. Table 7.7 shows that the general formats of the conditional function call instruction are (qp) br.btype.bwh.ph.dh b1=target25 (qp) br.btype.bwh.ph.dh b1=b2 When using these formats, the branch type is always specified with the .call completer. The first format uses IP-relative addressing to specify the starting address of the function. That is, the processor calculates the branch-to address of the bundle that contains the first instruction of the function by adding the immediate operand target25 to the current value of the instruction pointer. In Figure 7.7 , the call instruction written using this format is: (p0) br.call b0=new_func This call process is made unconditional by specifying Pr0 as the qualifying predicate. Since the function will always be called, no completers append the mnemonic. As part of the change in program control that takes place when this instruction is executed, the return address for the function is saved in branch register Br0 , which is specified as the destination operand. The second format of the br.call uses indirect addressing to specify the branch-to address. The starting address of the function resides in the branch register identified by source branch register b2 and the return address is saved in destination branch register b1 . An example is the instruction (p0) br.call b5=b6 The first operation that must be performed within the function is to resize the stack frame. Remember that resizing of the stack frame is the job of the allocate stack frame instruction. For this reason, an alloc instruction is included as the first instruction in the function. The alloc instruction must be located as the first instruction in an instruction group. The sequence of instructions that follow alloc represent the body of the function. The first instruction is marked as the beginning of an instruction group. When executed, this group of instructions performs the software operation that the function is written to implement. The last step in the function call process is to initiate a return to the main body of the program when the software operation performed by the routine is complete. As shown in Figure 7.7 , the return operation is initiated with a branch return (br.ret ) type of conditional branch
instruction. The general format of the return instruction is (qp) br.btype.bwh.ph.dh b1 Remember that the return address was saved during the call of the function in a branch register specified as the destination operand in the call instruction. For this reason, the register specified by b1 in the br.ret instruction must be the same as that used for the destination operand in the associated call instruction. Therefore, the branch return instruction required in the example is (p0) br.ret b0 Predication can be used to conditionally initiate one of two functions. The program structure in Figure 7.8 illustrates this type of program structure. Here a comparison operation determines if the values in Gr3 and Gr4 are equal, and produces predicate bits Pr1 and Pr2 based on the result of this comparison. If these values are equal, the predicates are made Pr1 = 1 and Pr2 = 0, and the function corresponding to the first branch instruction is initiated. Notice that the compiler has appended this branch mnemonic with the .spnt hint. By using this hint, the compiler tells the prediction mechanism of the Itanium processor that, based on an analysis of the program, it believes that this branch normally will not be taken. On the other hand, the function call for the other condition, when the value in Gr3 is not equal to that in Gr4 , is initiated for the conditional result Pr1 = 0 and Pr2 = 1. The compiler believes that this is the function that will normally be initiated. For this reason, the mnemonic for the function call instruction qualified by Pr2 is appended with the .sptk hint.
Figure 7.8: Conditional Execution of Function Calls These two function calls must be in the same bundle; otherwise, the behavior is undefined because foo could change the value of p2 .
Function Call Process Let us assume that another function is called from within the current function. This initiates a switch in program control from the caller function to the callee function. As part of the transition
process, execution of the original function is stopped; program context is saved to permit a return to the original function; a new stack frame is created for the new function; and then the new function is started. The instruction sequence in Figure 7.9 represents nested functions A and B. Assuming that function A is currently active, let’s trace through the sequence of events that take place as function B is evoked and executed to completion. The uppermost row in Figure 7.10 , which is marked “A frame” under the instruction column, shows the state of the register stack, CFM register, and PFM part of the PFS register when function A is active. The content of the CFM shows that the stack frame is sized with sol and sof equal to 14 and 21, respectively. As shown in the diagram, stack frame A is defined with the 14 registers ranging from Gr32 through Gr45 as the input/local area and an output area with 7 registers, Gr46 through Gr52 . The contents of the PFM are marked as “don’t care” states for our example. The values of sol and sof in the PFM correspond to the function before A, which is referred to as an ancestor of A. (p0)
(p0)
(p0)
(p0)
br.call b0=func_a . . . func_a: alloc r1=ar.pfs, 7,7,7,16 inst_A1 ;; inst_A2 . . . br.call b1=func_b . inst_Am br.ret b0 . . func_b: alloc r2=ar.pfs, 7,8,4,16 inst_B1 ;; inst_B2 . . inst_Bn br.ret b1
//Call of function A
//Resize stack frame for A
//Body of function A //Call of function B
//Return for function A
//Resize stack frame for B
//Body of function B
//Return for function B
Figure 7.9: Nested Functions A and B The first step in the control change sequence is initiated by the execution of the br.call instruction that calls func_b in Figure 7.9 . When this instruction is executed, the contents of
the CFM, the value in the epilog counter (EC) application register, and the current privilege level are saved in the PFS register. The initial stack frame for function B is illustrated by the second row from the top, which is marked "B frame" under the instruction column in Figure 7.10 . Notice that the CFM now holds sol equal to 0 and sof equal to 7. These values form an initial stack frame for function B that has no input/local registers and an output area with seven registers.
Figure 7.10: Stack Frame Transitions for Call and Return of Function B As shown, the new stack frame created for function B is equal in size to and overlays the output area of function A. Several other important changes have taken place in the register stack. First, the sliding window of the stacked registers has resulted in the stacked registers being renamed, such that the first register in A’s output area, which is actually register Gr46 , becomes register Gr32 for function B. Second, the size information for A’s stack frame has been saved in the PFM. This change in register naming is recorded in the register rename base for general registers (rrb.gr ) field of the CFM for function B. Finally, a return link address that points to the instruction bundle following that containing the call instruction for function B is placed in Br1 . The next instruction of function B in Figure 7.9 is the allocate stack frame instruction that resizes stack frame B. When this instruction is executed, the values of sol and sof in the CFM are adjusted to 15 and 19, respectively. Figure 7.10 shows that stack frame B expands such that its input area is aligned with the output area of function A, and it creates a new local area with eight registers and an output area with four registers. Notice that the content of the PFM still represents the size of the stack frame for function A. When the allocate instruction is executed, the complete contents of the PFS are saved in Gr2 . With the register stack in this state, function A’s general registers Gr32 through Gr45 are classified as dirty and registers Gr46 through Gr64 (now renamed Gr32 through Gr50 ) represent the current stack frame. There are no further changes in the organization of the stack as the instructions in the body of function B, which correspond to instr_B1 through instr_Bn in Figure 7.9 , are executed. When the return instruction for function B in Figure 7.9 is executed, program control is returned to the instruction bundle following the instruction in function A that called function B. First, the
values of the CFM, EC, and current privilege level for function A are restored from the PFS. Figure 7.10 shows that the CFM again holds sol equal to 14 and sof equal to 21, and the registers are renumbered so the stack frame is restored to the original configuration for function A. Also, the return address saved in Br1 is loaded into the instruction pointer. In this way, execution in function A resumes with the instruction following the call instruction for function B in Figure 7.9 . The contents of the static general registers are not automatically saved or restored as part of the function call and return process. If it is necessary to preserve information in them, software must be included at the function boundaries of the caller and callee to save them in memory and restore them when the caller function is resumed. The register assignments to the caller and callee are determined by the standard linkage convention of the compiler. The overhead of saving and restoring the contents of the stack frame by the RSE and the static registers by software may have an adverse impact on application performance.
TeamUnknown Release
Chapter 7 - Compare and Branch Instructions Itanium Architecture for Software Developers byWalter Triebel Intel Press 2000
Recommend this title?
LOOP PROGRAM STRUCTURES AND LOOP SUPPORT BRANCH INSTRUCTIONS
In Chapter 1 , the loop operation was introduced and identified as a program structure that can impact performance of applications. The loop operation is used to perform a repeated function. Some example are: to clear a series of storage locations in memory, copy of a block of data from one part of memory t another, search for a specific piece of information in an area of memory, or sort a table of information located in memory. Loops can be categorized into two types, based on how the termination condition fo the repeating operation is determined. These types are the counted loop and while loop. The terminatio condition for a counted loop is simply the value of a loop counter. For a while loop, termination is based a more complicated condition; for instance, the result of a comparison. The Itanium architecture provid special registers, instructions, and a technique known as software pipelining to implement and perform operations more efficiently.
Nonpipelined While Loop
Figure 7.11 represents the structure of a typical nonpipelined while loop routine. This loop operation is implemented with a compare instruction to evaluate the termination condition and a qualified conditiona branch instruction to perform the branch operation. A while loop operation can be used to search for a value in memory or sort a table of information. For instance, an array containing numbers might be sor into ascending numerical order. loop1_start:
inst_1 inst_2 . . .
//first and return instruction of counted loop
//body of counted loop
inst_n cmp.gt p1,p2 = r2,r3 (p2) br loop1_start;; inst_n+3
//conditional test for loop //counted loop branch instruction
Figure 7.11: Nonpipelined While Loop Program Structure
Notice that the first instruction of the while instruction sequence is marked with the label loop1_start Instructions 1 through n are the repeating body of the routine, and may access information in memory, perform computations on this data, and produce results in general registers or memory.
For example, the routine could be used to locate the first value in a range of storage locations in memo that is greater than a given value. In this example, the test value would need to be loaded into a registe instance, Gr3 , prior to initiating the loop operation. The main body of the program reads one value afte another from memory into a general register, such as Gr2 . After each number is read, a conditional comparison operation is performed to determine if the value in register Gr2 is greater than the test valu Gr3 . This comparison is done with a cmp.gt instruction. If the result is False(0), the predicates are ma Pr1 = 0 and Pr2 = 1, the branch to loop1_start is taken, and the search operation repeats for the ne number in memory. When a value is found that is greater than the value in Gr3 , the comparison is Tru and the predicate resultsare Pr1 = 1 and Pr2 = 0. This case represents the completion of the loop opera so the branch is not taken.
The while loop program sequence in Figure 7.12 searches through a table of information in memory tha starts at address block_start_address , looking for the first value that matches the test value, search_pattern . When a while loop operation is performed, the branch is normally taken. Therefore the short instruction sequence that performs the loop is repeatedly executed. For these reasons, the br instruction is appended with the branch hints: static taken (.sptk ), prefetch many (.many ), and don’t deallocate. These hints further define the branch operation for the processor. For instance, the instruct not being deallocated from the cache result in improved performance. movl r5, search_pattern movl r6, block_start_address search: ld8 r7 = [r6],8H cmp.ne p5,p6 = r5,r7 (p5) br.sptk.many search;;
//initialize pattern //initialize start address //load quad word from memory //conditional test for loop //repeat search or pattern fou
Figure 7.12: Search for a Matching Quad Word in a Block of Memory
Nonpipelined Counted Loop
The Itanium architecture contains special resources for improving the performance of nonpipelined cou loop operations. These resources are the counted loop branch instruction and the loop count (LC) regi As shown in Table 7.8 , the counted loop branch (.cloop ) type completer is appended to the branch mnemonic to specify the counted loop branch instruction. Table 7.7 shows that the general format of th
instruction is br.btype.bwh.ph.dh target25 A typical example of this instruction is br.cloop.sptk.many start_loop
This instruction does not have a branch predicate. Instead, the decision whether or not to branch is bas on the value in the LC register, which is application register 65. The LC is a 64-bit register and must be loaded with the initial value of the loop count prior to entering the body of the loop. The value of the loo count equals one less than the number of iterations performed by the loop routine. If the value in the LC greater than zero when the br.cloop instruction is executed, the LC is decremented and the br.clo branch is taken. If the count equals 0, the loop is complete and execution falls through to the instructio following the br.cloop instruction.
The branch operation that is performed by br.cloop employs IP-relative addressing. That is, the valu the branch-to-target address is obtained by adding immediate operand start_loop to the current val the instruction pointer.
The instruction sequence in Figure 7.13 represents a typical counted loop program structure. Notice th before entering the body of the loop a move long immediate instruction is used to load the initial count Gr5 , and then this value is copied into the loop count application register. The loop instruction sequenc starts with the instruction marked with the label loop2_start and ends at the br.cloop instruction. When this loop sequence is run, the body of the routine is repeated a number of times equal to init_count +1. Each time the br.cloop instruction is encountered, the count in the LC is decremen by 1 and tested for zero status. If the LC is nonzero, program control is returned to the instruction point to by immediate operand loop2_start and the instructions in the body of the loop are repeated. Whe the LC is found to be 0, the loop is complete and instruction inst_n+2 is executed.
loop2_start:
movl r5 = init_count //initialize LC with initial count mov ar.lc = r5 inst_1 //first and return instruction of counted loop inst_2 . . //body of counted loop . inst_n br.cloop loop2_start;; //counted loop branch instruction inst_n+2
Figure 7.13: Nonpipelined Counted Loop Program Structure
The instruction sequence in Figure 7.14 represents a simple counted loop application. When this instru sequence is run, the block of quad word memory storage locations ranging from memory address block_start_address through the ending address calculated as block_start_address +
(block_size*8-1) are cleared. mov1 r5 = block_size-1 mov ar.lc = r5 mov1 r6 = block_start_address clear: st8 [r6] = r0,8H br.cloop.sptk.few clear;;
//initialize LC with count //initialize starting address //clear memory location //test for repeat or done
Figure 7.14: Clearing a Block of Memory
Pipelined Implementation of Loops
The compiler has the ability to optimize the performance of counted and while loop operations by implementing them as software pipelined loops. The individual iteration of the code in the nonpipelined are performed sequentially one after the other as shown in Figure 7.15 (a). The compiler restructures t loop routine so that the individual iterations of the loop are executed in an overlapping fashion. Figure 7 (b) illustrates the approach. Not all loops can be implemented in this way. Use of pipelined loop techniq depends on the existence of a match between those execution resources needed to perform the loop routine and those available in the Itanium processor. When pipelining can be applied to a loop operatio the impact of added parallel execution of instructions is improved performance for the application.
Figure 7.15: Nonpipelined versus Pipelined Loop Execution
To structure a loop for pipelining, the instructions of the loop sequence are partitioned into stages. Spe
resources are provided in the architecture for implementing modulo-scheduled software pipelined loops These resources are special loop branch instruction types, the EC and LC application registers, and rotating registers.
Table 7.8 identifies the two kinds of modulo-scheduled loop types. The completers .ctop and .cexit for implementing pipelined counted loops. The .ctop completer is used when the terminate decision i made at the bottom of the loop instruction sequence and .cexit if the decision is made elsewhere in loop routine. The pipelined loop is terminated by a branch condition defined by the two counters, the LC and EC. Again, the initial value in LC equals 1 less than the number of iterations performed by the loop routine. The initial value of the EC equals the number of pipelined iterations it takes to complete one iteration of the loop. The br.ctop instruction takes the branch if either the LC is nonzero or the EC is greater than one, and terminates when the LC equals 0 and the EC equals 1. An example is the instruction: br.ctop.sptk.few start_loop Hints are supplied to the processor that the branch is normally taken, few instructions should be prefetched, and these instructions should be cached. The br.cexit instruction operates in a similar way. An example is br.cexit.sptk.few start_loop
It differs from the br.ctop instruction in how the decision whether or not to branch is made. In this ca the loop terminates if either the LC is nonzero or the EC is greater than 1.
For a while loop the qualifying predicate that results from a comparison is used as the condition for determining whether or not the branch is to take place. The .wtop and .wexit branch types are prov for implementing pipelined while loops. For br.wtop , the branch is taken and the loop continues until the qualifying predicate is 0 and the value in the EC is 1. The br.wexit loop terminates if either the qualifying predicate is 1 or the EC is greater than 1.
To replace a nonpipelined loop sequence with a pipelined implementation, the compiler must analyze t structure and execution of the routine. Let us use the loop routine in Figure 7.16 (a) as an example to demonstrate the process used to analyze the program. L1:
ld4 r4 = [r5],4;; add r7 = r4,r9;; st4 [r6] = r7,4 br.cloop L1;;
//Cycle //Cycle //Cycle //Cycle
0 2 3 3
(a) Typical counted loop routine
(p16)
ld4
r4 = [r5],4
//stage 1
load post-inc 4 add constant A store post-inc 4 repeat or done
(p17) (p18) (p19)
--add st4
r7 = r4,r9 [r6] = r7,4
//stage 2-empty stage //stage 3 //stage 4
(b) Restructured as stages
mov mov mov L1: (p16) (p18) (p19)
ar.lc = 9 //LC = loop count – 1 ar.ec = 4 //EC = epilog stages + 1 pr.rot = 1<<16;; //PR16 = 1, rest = 0
ld4 r32 = [r5],4 add r35 = r34,r9 st4 [r6] = r36,4 br.ctop L1;;
//Cycle //Cycle //Cycle //Cycle
0 0 0 0
(c) Implemented as pipelined loop
Figure 7.16: Three Treatments of the Counted Loop This program employs a loop operation to repeatedly perform the calculation Y[i] = X[i] + A
In this equation, X[i] and Y[i] are variables and correspond to values of data from tables X and Y, respectively, in memory, and A is a constant. The program reads a double word value of X from the X t adds it to the value of constant A, and saves the result in the corresponding location in the Y table. As of each memory access, the addresses that point to storage locations of the elements of data in the X Y tables are post-incremented by 4 so that they point to the next value in the table.
The first step in the analysis process is to arrange the instructions of the body of the loop routine in sta based on the clock cycle in which they are executed. Notice in Figure 7.16 (a) that because dependenc exist between the general registers of the operands there is very little parallel execution. The load, add store operations are all performed in different clock cycles and take two cycles, one cycle, and one cyc respectively. Even though the store instruction takes two cycles to complete, this operation is considere effectively one cycle because the repeat of the loop does not depend on this result existing in memory.
Based on the clock cycle information for the individual instructions, the body of the loop is partitioned in the stages shown in Figure 7.16 (b). Notice that an extra stage is added to account for the second cloc cycle needed to complete the load operation. Predication also plays an important role in loop pipelining qualifying predicate is assigned to the instructions in each stage to control activation of the stage. Thes predicates must be rotating predicates; for this reason, they must be in the range from Pr16 through Pr6
You must first understand how the instructions in a pipelined loop are executed, before being able to recode a nonpipelined loop for pipelined operation. Table 7.13 shows the order in which the instruction our example routine are executed, and the execution units that are active during each clock cycle. Remember that the load instruction takes two clock cycles to complete. For this reason, the first add
operation cannot be performed until clock cycle 3. Also, an add must take place before the first store operation. This is why the first store takes place in clock cycle 4. In this way, you see that the first iterat of the loop still takes four clock cycles; however, the second, third, and fourth iterations of the loop are in progress in parallel. Notice that the second store to memory, which signifies the completion of the second iteration of the loop operation, occurs in clock cycle 5. If this loop operation were performed in nonpipelined way, eight clock cycles would be required to complete these two iterations. 1 ld4 br.ctop Prolog 2 ld4 br.ctop ” 3 ld4 add br.ctop ” 4 ld4 add st4 br.ctop Kernel 5 ld4 add st4 br.ctop ” 6
ld4 add st4 br.ctop ” 7 ld4 add st4 br.ctop ” 8 ld4 add st4 br.ctop ” 9 ld4 add st4 br.ctop ” 10 ld4 add st4 br.ctop ” 11 add
st4 br.ctop Epilog 12 add st4 br.ctop ” 13 st4 br.ctop ”
Table 7.13: Phases of Execution Execution unit/instructions Cycle
M
I
M
B
Phase
The example in Table 7.13 represents a modulo-scheduled pipelined loop . This kind of loop operation three execution phases known as the prolog, kernel, and epilog . The cycles associated with each of th phases are identified in the table. Notice that the prolog phase corresponds to clock cycles 1 through 3 represents the filling of the pipeline. During the kernel phase, which is represented by clock cycles 4 through 10, a new loop iteration is started and another is finished in each clock cycle. Since the pipelin full, each of the stages of execution (load, add, store, and branch) is performed. The epilog phase represents the draining of the pipeline. During clock cycles 11 through 13, no new iterations are started instead, the previous iterations are finished. The finishing of these loops represents the draining of the pipeline. All 10 iterations of the loop are completed in 13 clock cycles versus the 40 needed by a nonpipelined implementation. This large reduction in clock cycles demonstrates the positive impact on application performance of pipelined loops.
Our explanation of the instruction execution sequence illustrated in Table 7.13 has left some questions unanswered. First, you observe that not all instructions are executed during each clock cycle. How is th accomplished? The qualifying predicates that were assigned to the instructions in Figure 7.16 (b) are u to control when the instructions execute. Another question that you may ask is “How can a second data load take place before the add is performed that uses the element of data from the first load? Why is th first element of data not lost?” The answer to this question is that general register rotation is performed pipelined counted loop instructions. Therefore, the data from the two loads are not put into the same register and the two add operations that follow do not add data from the same registers. In fact, both th
general registers and predicate registers are automatically rotated by the branch instruction in pipelined loop operations.
The remaining question that needs to be answered is “How are the rotating general registers and predi registers used and initialized for the pipelined loop?” To understand this, we must look more closely at a pipelined loop is executed.
Table 7.14 is a more detailed summary of the execution of our example of a pipelined counted loop. He the values of the predicate registers, loop count, and epilog count are included for each of the clock cy 1 ld4 br.ctop 1 0 0 0 9 4 2 ld4 br.ctop 1 1 0 0 8 4 3 ld4 add br.ctop 1 1 1
0 7 4 4 ld4 add st4 br.ctop 1 1 1 1 6 4 5 ld4 add st4 br.ctop 1 1 1 1 5 4 6 ld4 add st4 br.ctop 1
1 1 1 4 4 7 ld4 add st4 br.ctop 1 1 1 1 3 4 8 ld4 add st4 br.ctop 1 1 1 1 2 4 9 ld4 add st4
br.ctop 1 1 1 1 1 4 10 ld4 add st4 br.ctop 1 1 1 1 0 4 11 add st4 br.ctop 0 1 1 1 0 3 12 add st4
br.ctop 0 0 1 1 0 2 13 st4 br.ctop 0 0 0 1 0 1 14 0 0 0 0 0 0
Table 7.14: Pipelined Execution of the Loop Program Cycle
Execution unit/instructions
State before br.ctop
M
p16
I
M
B
p17
p18
p19
LC
EC
The values that exist during clock cycle 1 are the initial values and must be loaded before beginning execution of the loop. Notice that predicate register Pr16 is set to 1, while Pr17 through Pr19 are cleared The loop count must also be initialized. Since the loop in the example is to repeat 10 times, the LC is s
9. Finally, EC equals the number of pipelined iterations it takes to complete one complete iteration of th nonpipelined loop routine. Notice in Table 7.14 that the first iteration of the loop completes in the fourth pipelined iteration; therefore, the initial value of EC is 4.
Remember that Pr16 , Pr18 , and Pr19 qualify the load, add, and store instructions, respectively. Since o Pr16 is 1 during the first execution of the loop, Table 7.14 shows that the only operation performed is a Execution of the br.ctop instruction that completes the first loop causes the value of the LC to decrem to 8, rotates the predicate registers by renaming, and loads the value 1 into the new Pr16 . Due to the predicate register rotation, the original value in Pr16 is now in Pr17 . Moreover, rotation of the general registers causes the value loaded into a general register to be shifted into the next-higher-numbered register. This frees up the register corresponding to the original load operation for entry of a new value data.
As shown in the table, both Pr16 and Pr17 are 1 during clock cycle 2. Again, only the load instruction is qualified for execution and the value of data is loaded into the register associated with the load instruct As the branch instruction executes, the LC decrements to 7, the predicate registers and general registe are again rotated upward one position, and Pr16 is reloaded with 1.
During clock cycle 3, Pr16 , Pr17 , and Pr18 are all 1. For this reason, both the load and add instruction a qualified for execution. During this loop, a third element of data is loaded and the first add takes place. value added to the constant entered one register during the first clock cycle, but now due to rotation res in the register numbered 2 higher. Recognizing that loaded data shifts two register positions before get added is important. Because of this two-position register rotation, the register number associated with t variable source operand in the addition instruction in the pipelined version of the counted loop routine m be 2 higher than that of the destination operand in the load instruction. At the end of this loop, the regis rotate one position higher and Pr16 is again loaded with 1. After this rotation, the result produced by the instruction is located in the general register 3 higher than where it originally entered.
Looking at Table 7.14 , we see that during clock cycle 4, Pr16 through Pr19 are all 1. Therefore, the load add, and store operations are all qualified to occur. This signals that the prolog phase is complete and kernel phase is beginning. Remember that the result of the add operation that gets stored in memory during this cycle has rotated one register position higher than the destination register in which it was sa in the prior cycle. In this way, we find another important characteristic for implementing the pipelined lo The source register of the store instruction must be numbered 1 higher than that of the destination in th add instruction. The kernel mode of operation repeats through clock cycle 10. Each loop causes a load add, store, register rotate, decrement of the LC, and reload of 1 into Pr16 .
During clock cycle 10, the LC is 0. This means that all 10 values have been loaded, and signals the transition from kernel to epilog phase. When the branch instruction executes at the end of this loop, the general registers and predicates are again rotated by one position, but this time Pr16 is loaded with 0 instead of 1. This event marks the transition to the epilog phase.
As clock cycle 11 is executed, the load instruction is no longer qualified and a load operation is not performed. Notice that during clock cycle 11, add and store operations are performed, registers are rot Pr16 is loaded with 0, and the EC decrements for the first time. This continues until the pipeline is drain
From the description of the pipelined execution of this loop program, we have discovered the informatio needed to recode the nonpipelined counted loop program as a pipelined loop:
1. General registers that hold variables must be relocated to the rotating registers. Any of the registe the range from Gr32 through Grn can be used, where n = 32 * sor – 1 (as provided in the alloc instruction).
2. The execution states of the pipelined instruction sequence must be assigned to qualifying predica The starting predicate register must be Pr16 . 3. Either the br.ctop or br.cexit instruction must be used to terminate the loop.
4. The loop count in the branch instruction must be 1 less than the number of loop iterations and the register must be initialized with this value prior to the start of the loop sequence. 5. The EC register must be initialized with a value equal to the number of pipelined loop iterations needed to complete one iteration of the nonpipelined loop.
6. Pr16 must be initialized to 1 and the higher numbered predicate registers used in the pipelined loo must be cleared prior to starting the loop sequence.
7. The number of the register used for the source operand of the variable in the add instruction must 2 higher than that used for the destination in the load instruction, and the source register in the sto instruction must be numbered 1 higher than the destination register in the add instruction.
The compiler uses this information to create a pipelined solution for the counted loop routine. Figure 7. (c) shows the pipelined version of the example program.
TeamUnknown Release
Chapter 8 - Multimedia Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Chapter 8: Multimedia Instructions The part of the instruction set covered in Chapters 5 through 7 applies to a wide variety of software applications. This chapter covers a segment of the instruction set, multimedia instructions, that is targeted at a special application area, processing of multimedia data. Examples of important applications that require processing of multimedia data are graphics and some audio processing applications. The Itanium processor architecture includes multimedia instructions that perform parallel data processing. Addition of these special instructions to the ISA results in higher performance for multimedia applications. The multimedia instructions implemented in the Itanium architecture are semantically compatible with Intel® MMX™ technology instructions and streaming SIMD extensions instruction technology.
MULTIMEDIA DATA STRUCTURES Multimedia applications, such as audio and video, involve the processing of large arrays of independent elements of data. Figure 8.1(a) shows how multimedia data is organized in memory. These elements of data could represent digital samples of an audio signal. The source and destination operands processed by the special multimedia instructions are always 64 bits wide. Notice in Figure 8.1(a) that this quad word can be organized as eight contiguous 8-bit-wide elements (8(8), four contiguous 16-bit-wide elements (4x16), or two contiguous 32-bit wide elements (2(32).
Figure 8.1: Little-endian Ordered Multimedia Parallel Data in Memory and Registers Looking at the individual 8x8 elements, we see that they are identified as element 000, which is located at the lowest byte address, through 111, which is held at the highest byte address. The 4x16 data represents a data structure with four 16-bit elements of data labeled elements 00 through 11. Data structured in this way is known as parallel data. Just like integer data processing instructions, the source and destination operands of multimedia instructions must reside in general registers. Transfers of multimedia data between memory and the general registers are not performed with multimedia instructions. Like other types of data, they are loaded or saved with memory access instructions. Figure 8.1(b) shows how each of the multimedia data structures are loaded into general registers of the Itanium processor for processing. For instance, the 4-wordwide data elements 00 through 11 would be loaded into general register Gr2 with element
00 in the least significant word position and element 11 in the most-significant word position. A multimedia instruction recognizes each element of data in a parallel data structure as an independent piece of multimedia data, but processes them in parallel. Some multimedia instructions do not process all of the elements of data in the source operands. For instance, the result might be formed only from the odd or even elements of two source operands. In the 8x8 structured element in register Gr1, the elements 001, 011, 101, and 111 are considered the odd elements and 000, 010, 100, and 110 the even elements. They are denoted in an instruction, as the left (.l) and right (.r) elements in a register, respectively. Another way the elements of data in a source operand are partitioned for processing are as the higher (.h) or lower (.l) elements. In this case, elements 100, 101, 110, and 111 are considered as the higher elements and 000, 001, 010, and 011 are treated as the lower elements.
TeamUnknown Release
Chapter 8 - Multimedia Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
MULTIMEDIA INSTRUCTION GROUP Earlier we indicated that the Itanium architecture supports multimedia instructions that are compatible with Intel’s MMX technology instructions and streaming SIMD extensions instruction technology. The multimedia instruction group includes three classes of instructions: arithmetic, shift, and data arrangement. For instance, arithmetic instructions are provided to add, subtract, and multiply multimedia data. Most of the multimedia instructions are defined to process 8x8, 4x16, and 2x32 multimedia data. SIMD stands for “single instruction multiple data.” The general registers are considered to hold a set of aligned independent elements of data. When a multimedia instruction is executed, the processor performs that operation independently on each of the multiple elements in the specified general registers in parallel. For example, if an add operation is performed on source registers that hold 2x32 data, two independent computations are performed, one to add the lower double words (element 0) in the source registers and another to add the upper double words (element 1) in these registers. Two separate aligned results are produced in the destination register. For this reason, the multimedia instructions are doing parallel data processing.
TeamUnknown Release
Chapter 8 - Multimedia Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
PARALLEL ARITHMETIC INSTRUCTIONS The first instructions from the multimedia instruction group that we examine in this chapter are those that perform parallel arithmetic operations. Support for a wide variety of parallel arithmetic operations is provided in the Itanium processor’s instruction set. These parallel operations include standard arithmetic functions such as add, subtract, multiply, and compare, as well as special parallel functions including average, minimum, and maximum. Graphics and some audio processing applications are typically arithmetic intensive.
Parallel Addition Instructions Four different parallel arithmetic instructions involve the addition of multimedia data. These instructions are the parallel add (padd) instruction, parallel average (pavg) instruction, parallel shift left and add (pshladd) instruction, and parallel shift right and add(pshradd) instruction. Table 8.1 identifies the three formats of the padd instruction. Table 8.1: Parallel Arithmetic Instruction Formats Mnemonic
Operation
Format
padd
Parallel add
(qp) padd1 r1=r2,r3 (qp) padd2 r1=r2,r3 (qp) padd4 r1=r2,r3
pavg
Parallel average
(qp) pavg1 r1=r2,r3 (qp) pavg2 r1=r2,r3
pshladd
Parallel shift left and add
(qp) pshladd2 r1=r2,count2,r3
pshradd
Parallel shift right and add
(qp) pshradd2 r1=r2,count2,r3
Mnemonic
Operation
Format
psub
Parallel subtract
(qp) psub1 r1=r2,r3 (qp) psub2 r1=r2,r3 (qp) psub4 r1=r2,r3
pavgsub
Parallel average subtract
(qp) pavgsub1 r1 r2,r3 (qp) pavgsub2 r1 r2,r3
pcmp
Parallel compare
(qp) pcmp1.prel r1 r2,r3 (qp) pcmp2.prel r1 r2,r3 (qp) pcmp4.prel r1 r2,r3
pmpy
Parallel multiply
(qp) pmpy2 r1=r2,r3
pmpyshr
Parallel multiply and shift right
(qp) pmpyshr2 r1=r2,r3,count2
psad
Parallel sum of absolute difference
(qp) psad1 r1=r2, r3
pmin
Parallel minimum
(qp) pmin1.u r1 r2,r3 (qp) pmin2 r1=r2,r3
pmax
Parallel maximum
(qp) pmax1.u r1 r2,r3 (qp) pmax2 r1=r2,r3
The add operations they perform are the same, except for the size of the data elements they process. For example, padd1 adds a set of eight 8-bit parallel data elements and padd4 adds a set of two 32-bit parallel data elements. The general format of the word parallel add instruction is: (qp) padd2 r1=r2,r3 When executed, the four aligned words of data in the source registers, identified by r2 and r3, are independently added and the four sums that are produced are put into the destination register specified by r1. That is, element 0 in r2 is added to element 0 in r3 and the sum placed in the element 0 position of r1, element 1 in r2 is added to element 1 in r3 and this sum is placed in the element 1 position of r1, and so on. This add operation is illustrated in Figure 8.2. Several different types of add operations can be performed with this instruction by appending the parallel add mnemonic with a completer. Table 8.2 shows the permitted completers. Notice that if no completer is provided, a modulo add is performed. On the other hand, the other three completers specify what are called saturated add operations.
For instance, the signed saturation add (.sss) completer identifies that the values of the set of multimedia data elements in the source and destination registers are all signed integer numbers. The padd4 form of the instruction supports only modulo add operations.
Figure 8.2: 4x16 Parallel Add Operation Table 8.2: Saturation Completers Completer
Meaning
Result r1
Source r2
Source r3
Instructions
None
Modulo
sss
Signed saturation
Signed
Signed
Signed
padd1, padd2, psub1, psub2, pack2, pack4
uus
Unsigned saturation
Unsigned
Unsigned
Signed
padd1, padd2, psub1, psub2
uuu
Unsigned saturation
Unsigned
Unsigned
Unsigned
padd1, padd2, psub1, psub2
uss
Unsigned saturation
Unsigned
Signed
Signed
pack2
padd1, padd2, padd4,psub1, psub2, psub4,pack2, pack4
Let us next look at the difference between the modulo and saturation add operations. These two types of add operations only differ in how they handle a result that does not fit within the specified size of the data element. For instance, addition of two unsigned 16-bit numbers could result in a 17-bit result. For the saturation forms of the parallel add
operation, a result that is larger than the largest value that can be represented with the number of bits in the data element is made equal to the largest value that can be coded with this size element. Similarly, a result that is smaller than the smallest value is made equal to the smallest value that can be coded within the specified number of bits. This operation is known as saturation clipping. The upper and lower limits for a specific size of multimedia element are called the saturation limits. The saturation limits of 8-bit and 16-bit data elements are shown in Table 8.3. For instance, this table shows that an upper limit of FFFFH and lower limit of 0000H bound the range of sums that can be produced for the unsigned number addition performed by a padd2.uuu instruction. If the result of adding two aligned elements of the source operand is larger than 16-bits, the result placed in the destination register is clamped at the upper limit FFFFH. Similarly, if the sum produced by adding two negative numbers with the instruction padd2.sss is less than 8000H, the result placed in the specified destination register is clamped at 8000H. The second type of unsigned saturation addition instruction, padd.uus, treats one source operand as an unsigned number, the other source operand as a signed number, and clamps the result to the limits of the unsigned number range. Table 8.3: Parallel Saturation Limits Size
Source width
Result width
Saturation
Result upper limit
Result lower limit
Instructions
1
8 bit
8 bit
Signed
0x7F
0x80
padd1, psub1
1
8 bit
8 bit
Unsigned
0xFF
0x00
padd1, psub1
2
16 bit
16 bit
Signed
0x7FFF
0x8000
padd2, psub2
2
16 bit
16 bit
Unsigned
0xFFFF
0x0000
padd2, psub2
2
16 bit
8 bit
Signed
0x7F
0x80
pack2
2
16 bit
8 bit
Unsigned
0xFF
0x00
pack2
4
32 bit
16 bit
Signed
0x7FFF
0x8000
pack4
The modulo form of the instruction does not clamp the result of an addition exceeding the largest or smallest value; instead, it wraps the value around in the range of the destination element. Example If the elements of multimedia data in Gr5 and Gr6 are EFFFFFFF00000000H and
1100001100000001H, respectively, what results are produced in Gr4 by executing the following instructions? (p0) padd1.uuu r4=r5,r6 (p0) padd1 r4=r5,r6 Solution The first instruction treats the operands as unsigned numbers and performs eight 8-bit additions. Starting from the uppermost data element, they are: EFH + 11H = 100H FFH + 00H = FFH FFH + 00H = FFH FFH + 11H = 110H 00H + 00H = 00H 00H + 00H = 00H 00H + 00H = 00H 00H + 01H = 01H The results that exceed the 8-bit size of the data element are replaced by the upper saturation limit of FFH. The results produced in the destination register are Gr4 = FFFFFFFF00000001H The modulo version of the instruction performs the same add operations; however, different results are produced for the two additions that exceed the upper limit of 8-bit numbers. This time the values wrap around to produce the result Gr4 = 00FFFF1000000001H
The parallel average (pavg) instruction calculates the average of two unsigned multimedia data elements. This instruction supports just 8x8 and 4x16 data structures and they are selected by using the mnemonics pavg1 and pavg2, respectively. For instance, the following instruction format is used to form the average for each of the four aligned 16-bit elements of data in the source registers specified by r2 and r3 and place them in the destination register specified by r1. (qp) pavg2 r1=r2,r3
To prevent the occurrence of a round-off error, rounding must be performed as part of the average calculation operation. A completer can be added to the mnemonic of the parallel average instruction to specify the type of rounding employed by the average calculation algorithms. Table 8.4 shows that specifying no completer selects a normal average calculation and appending the .raz completer selects a method known as round-away-from-zero. Table 8.4: Parallel Average Completers Completer
Meaning
None
Normal
raz
Round-away-from-zero
Let us next look more closely at how the normal and round-away-from-zero average calculations are performed. The normal average calculation performed by the pavg2 instruction is displayed inFigure 8.3(a). Notice that each of the four 16-bit elements of data in the source register specified by r2 is added to the corresponding element in the register specified by r3. Then, each of the resulting four sums is separately divided by 2 by shifting its bits right by one bit position. The carry bit shifts into the most significant bit of each 16-bit average. This division is the source of the need for rounding. Figure 8.3(a) shows how normal rounding is accomplished. The logical OR of the two least significant bits of the sum is put in the least significant bit position of the average. Therefore, if either of these two bits in the sum is 1, the least significant bit of the average in r3 is rounded up to 1.
Figure 8.3: 4x16 Parallel Average Operations with Rounding The equivalent round-away-from-zero average calculation is demonstrated in Figure 8.3(b). Here rounding is accomplished by simply adding 1 to each sum before performing the right shift. The parallel shift left and add instruction (pshladd) performs a left shift on the elements of the first source, then adds them to the corresponding elements from the second source. Signed saturation is performed on both the shift and the add operations. The parallel shift right and add instruction (pshradd) is similar to pshladd. Both of these instructions are defined for 2-byte elements only. Example The 8 X 8 elements of multimedia data in registers Gr5 and Gr6 are EFFFFFFF000000F0H and 1200001100000014H, respectively. What results are produced in Gr4 for the most-significant element by executing the following instructions? (p0) pavg1 r4=r5,r6 (p0) pavg1.raz r4=r5,r6
Solution Executing the first instruction performs the average calculations as follows: EFH + 12H = 101H = 1000000012 Note that the least-significant bit in the sum is logic 1. Therefore, rounding makes the least-significant bit of the average 1, shifting right to perform the divide by 2, and gives: 100000012 = 81H Now, executing with the round-away-from-zero instruction gives this result: EFH + 12H + 01H = 102H = 1000000102 10000001H = 81H
Instructions are also provided in the instruction set that permit the multimedia data in one of the source registers to be shifted to the left or right before being added to the other source operand. The instructions that perform these add operations are parallel shift left and add (pshladd) and parallel shift right and add (pshradd). Table 8.1 gives the general format of these instructions. Notice that the multimedia data processed by both instructions must be arranged in the 4x16 structure. When the instruction is executed, each of the four 16-bit elements of data in the source general register identified by r2 is independently shifted left or right by the number of bit positions specified by immediate operand count2. With this 2-bit immediate operand field, the shift operation is limited to 1, 2, or 3 bit positions. During shift left operations, bit locations that are emptied on the least-significant bit end of an element of data are filled with 0s. On the other hand, the most-significant bit end locations that are vacated during a shift right operation are filled with the value of the sign bit in the original number. After the shift is complete, the four updated values of data in the register specified as r2 are independently added to the corresponding elements of data in the source register identified by r3. The resulting four sums are placed in the destination general register specified by r1. The multimedia data elements processed by the shift and add instructions are signed 16bit integer numbers. Since the source and destination operands are signed integers, saturation limits are applied to the results of both the shift and add operations. Therefore, if the value produced by either of these operations cannot be represented as a signed 16-bit value, the result is replaced by the appropriate limit. The upper and lower limits for 16-bit signed numbers are given in Table 8.3 as 7FFFH and 8000H, respectively. Example
If the 4x16 elements of multimedia data in Gr5 and Gr6 are 1FFFFFFF00000000H and 1100001100000001H, respectively, what result is produced in Gr4 by the most-significant element of data when the instructions that follow are executed? (p0) pshladd2 r4=r5,2H,r6 (p0) pshradd2 r4=r5,2H,r6 Solution Executing the first instruction causes the original value of the most-significant data element in Gr5 (shown below) to be shifted left by two bit positions with the emptied LSBs filled with 0s. 1FFFH = 00011111111111112 The result is: 01111111111111002 = 7FFCH Now, adding this value to the most-significant element in Gr 6 and applying signed saturation limits produces the result: 7FFCH + 1101H = 90FDH = 7FFFH Executing the shift right instruction shifts the original value in Gr5 right by two bit positions and fills the vacated MSB locations with the original sign bit, which is 0, as follows: 0000011111111111H = 0EFFH Adding to the most significant value in Gr6 produces: 0EFFH + 1101H = 2000H This result does not exceed the 16-bit signed saturation limit.
Parallel Subtraction Instructions Two kinds of parallel subtraction operations are supported with instructions in the instruction set. These instructions, which are identified by the mnemonics psub and pavesub, perform a parallel subtraction and a parallel average subtraction calculation, respectively, for the individual elements of a multimedia data set. These instructions and their operation are similar to the parallel add and parallel average instructions we just described; however, they subtract the elements in the operands instead of adding them. The formats of the parallel subtract (psub) instruction are shown in Table 8.1. Comparing
these formats to those of the padd instruction, we find that they are the same. The subtraction operation performed by the psub1 form of the instruction is illustrated in Figure 8.4. Execution of this instructions subtracts individual elements of the 8x8 set of data elements in the register specified by r3 from the corresponding elements in the register specified by r2. The eight separate differences that are produced are placed in the register identified by destination operand r1. The type completers in Table 8.2 apply to the operands in the psub instruction and the saturation limits in Table 8.3 apply to the difference results produced by the subtraction.
Figure 8.5: 4x16 Parallel Average Subtraction Operation The parallel average subtract (pavgsub) instruction processes a set of unsigned integer data elements. The pavgsub formats in Table 8.1 are the same as those employed by the pave instruction. Moreover, the operation it performs is similar to the normal parallel average calculation described earlier. The exceptions are that the individual elements in the set of elements in the source registers are independently subtracted before the shift right takes place to form the average. The most significant bit of the result, which is vacated during the shift operation, is filled with the value of the borrow that results from the subtraction. The steps in this average calculation operation are outlined in Figure 8.5. Unlike the pave instruction, pavesub only supports the normal method of rounding for the least-significant bit of the result.
Figure 8.4: 8x8 Parallel Subtraction Operation
Parallel Compare Instruction In Chapter 7 we found the integer compare instruction is used to compare two integer numbers and set or reset predicate registers depending on the result of the comparison operation. Similar to the compare instruction, the parallel compare (pcmp) instruction compares the values of elements of data, but the results produced by these comparisons are not recorded in predicate registers. Table 8.1 contains the general formats of the parallel compare instruction. Notice that it can process 8x8, 4x16, or 2x32 multimedia data operands. The comparison operation that is performed on these operands is selected by appending the appropriate parallel relationship (.prel) completer to the instruction mnemonic. Table 8.5 shows that only two comparison operations can be performed, equal (.eq) or greater than (.gt). The operation of the parallel compare instruction that follows is demonstrated in Figure 8.6. (p0) pcmp2.eq r1=r2,r3 Table 8.5: Parallel Comparison Relationships prel
Comparison relationship
eq
r2 = = r3
gt
r2 > r3 (signed)
Figure 8.6: 4x16 Parallel Comparison Operation When this instruction is executed, the four individual 16-bit elements of data in the general register specified by r3 are independently compared to their corresponding elements in the general register identified by r2. If the values of two corresponding elements are equal, bits associated with this element of data inthe destination register specified by r1 are all set to 1. On theother hand, if they are not equal, the bits corresponding to the element in the result are cleared to 0. The results shown in destination register r1 in Figure 8.6 illustrate a parallel comparison operation in which the first, second, and fourth 16-bit elements in r2 and r3 are found to be equal, while the third elements in these two registers are not equal. Example If the values in Gr2 and Gr3 are 00FF7FFF8080807FH and FF0080FE7F7F7F0FH, respectively, what results are produced in Gr1 when the instruction that follows is executed? (p0) pcmp1.gt r1=r2,r3 Solution When the parallel compare instruction performs the greater-than relationship, the individual data elements in the source and destination operands are interpreted as signed integer numbers. The eight comparisons performed with this instruction are:
Most-significant element:
00H > FFH = 0 > –1 = True FFH > 00H = –1 > 0 = False 7FH > 80H = +127 > –128 = True FFH > FEH = –1 > –2 = True 80H > 7FH = –128 > +127 = False 80H > 7FH = –128 > +127 = False
Least-significant element:
80H > 7FH = –128 > +127 = False 7FH > 0FH = +127 > +15 = True
Figure 8.7 shows the results that are produced in destination register Gr1 by this parallel comparison operation.
Figure 8.7: Example of an 8x8 Parallel Comparison
Parallel Multiplication Instructions In the discussion of integer arithmetic instructions in Chapter 6, we found that multiplication of integer numbers is performed by the floating-point multiply instructions. The floating-point instructions do not have the ability to perform parallel multiplications of multimedia data. For this reason, special parallel multiply instructions are included in the instruction set. The two instructions that perform parallel multiplication operations are parallel multiply
(pmpy) and parallel multiply and shift(pmpyshr). The general format of the pmpy instruction is given in Table 8.1 as (qp)
pmpy2
r1 = r2,r3
Notice that it only supports multiplication of 4x16 structured elements of multimedia data. Figure 8.8(a) and (b) illustrates that either a parallel left multiplication or parallel right multiplication operation can be performed. Table 8.6 shows that the completers used to select a left or right multiplication are .l or .r, respectively. For example, the following instruction performs two independent, 16-bit by 16-bit multiplications, using the corresponding left elements in source operand registers Gr2 and Gr3. (p0) pmpy2.l r1=r2,r3 Looking at Figure 8.8(a), we see that the left elements are the second (element 01) and fourth (element 11) signed 16-bit elements of data in the source registers. Remember that multiplying two 16-bit numbers can give a product containing up to 32 bits. For this reason, the 32-bit product that results from multiplying the 01 elements in the source registers is written to the lower (0 element position) 32 bits of destination register Gr1. The product produced from multiplying the corresponding fourth elements is placed in the higher 32 bits(1 element position) of Gr1.
Figure 8.8: Right and Left Parallel Multiplication Operations Table 8.6: Multiply Completers Completer
Meaning
l
Left
r
Right
The parallel multiply and shift right instruction performs a different type of multiplication operation. The operation of this instruction is demonstrated in Figure 8.9. This diagram shows that four independent 16-bit by 16-bit multiplications are performed using the corresponding elements in source registers Gr2 and Gr3. The value of the data elements is considered to be a signed integer unless the instruction mnemonic is appended with a .u completer. The four 32-bit products that result are not directly used to update the destination register Gr1. Instead, they are first shifted right by a specified number of bit positions, then truncated to 16 bits.
Figure 8.9: Parellel Multiplication and Shift Right Operation Looking at the general format of the instruction in Table 8.1, we see that a 2-bit count (count2) is specified as an immediate operand. The value of this count determines how many bit positions each independent 32-bit product is shifted to the right. Table 8.7 shows the allowed values for count2, resulting number of bit positions shifted, and range of bits written to Gr1 as the result. Table 8.7: Parallel Multiplication and Shift Right Count Options count2
Bit positions shifted
Selected bit field from each 32-bit product
0
0
15:0
1
7
22:7
2
15
30:15
3
16
31:16
For instance, the following instruction independently shifts the values of the four 32-bit elements right by 7 bit positions. (p0) pmpyshr2 r1=r2,r3,7 Therefore, four 16-bit values formed from bits 7 through 22 of each product are saved in the destination register.
Parallel Sum of Absolute Difference Instruction
The parallel sum of absolute difference (psad) instruction performs another important parallel software subtraction operation needed in multimedia applications. The format of the instruction is given in Table 8.1 and the operation that it performs is demonstrated in Figure 8.10. This diagram shows that the first operation performed is a parallel subtraction. Each of the eight independent 8-bit data elements in source register Gr3 is subtracted from its corresponding data element in Gr2. The parallel data elements in the source registers are interpreted as unsigned numbers. The next step in the process is to obtain the absolute value for each of the eight independent differences. Finally, the individual absolute values are added together to produce the result in Gr1. Notice that, as a result, a 16-bit data element is produced, and it is placed in the 00-element location of Gr1. The other three elements in Gr1 are filled with zeros.
Figure 8.10: Parallel Sum of Absolute Difference Operation Example If the values in Gr5 and Gr6 are 00FF7FFF8080807FH and FF0080FE7F7F7F0FH, respectively, what results are produced in Gr10 when the instruction that follows is executed? (p0) psad1 r10=r5,r6 Solution When the instruction is executed, the eight unsigned elements of data in Gr6 are subtracted from the corresponding elements in Gr5 and then the absolute value is found for each of the eight differences. This gives:
Most-significant element:
|00H – FFH| = |0–255| = |–255| = 255 |FFH – 00H| = |255–0| = |255| = 255 |7FH – 80H| = |127–128| = |–1| = 1 |FFH – FEH| = |255–254 = |1| = 1 |80H – 7FH| = |128–127| = |1| = 1 |80H – 7FH| = |128–127| = |1| = 1
Least-significant element:
|80H – 7FH| = |128–127| = |1| = 1 |7FH – 0FH| = |127–15| = |112| = 112
Now, adding individual absolute values gives the result in Gr10. Gr10 = 255 + 255 + 1 + 1 + 1 + 1 + 1 + 112 = 627 = 273H
Parallel Minimum and Maximum Instructions The last two parallel arithmetic instructions are parallel minimum (pmin) and parallel maximum (pmax). These instructions are used to independently compare the values of corresponding elements in two sets of multimedia data and produce a result from either the smaller values or larger values, respectively. This kind of operation is important for applications dealing with graphics and audio/video data manipulations. The formats of the pmin and pmax instructions are shown in Table 8.1. Notice that the first form of each instruction(pmin1.u/pmax1.u) processes parallel data structured as eight 8-bit unsigned integers. The other form treats the values of the operands as four 16-bit signed integers. Figure 8.11 demonstrates the processing performed by the pmin2instruction. Notice that first a signed comparison is made between each of the individual 16-bit elements of data in Gr2 and its corresponding element of data in Gr3. Then, the value of the smaller of the two numbers from each of the four pairs of parallel data is placed in destination register Gr1. The operation of pmax2, which is shown in Figure 8.12, is similar except that the result is formed from the greater of the two numbers in each pair.
Figure 8.11: 4x16 Parallel Minimum Operation
Figure 8.12: 4x16 Parallel Maximum Operation Example If the values in Gr5 and Gr6 are 00FF7FFF8080807FH and FF0080FE7F7F7F0FH, respectively, what results are produced in Gr10 when the following instructions are executed? (p0) pmax1.u r10=r5,r6 (p0) pmin1.u r10=r5,r6 Solution When the pmax1.u instruction is executed, the eight unsigned elements of data in Gr5 are compared with the corresponding elements in Gr6. This gives the results:
Most-significant element:
00H > FFH = 0 > 255 = False = FFH FFH > 00H = 255 > 0 = True = FFH 7FH > 80H = 127 > 128 = False = 80H FFH > FEH = 255 > 254 = True = FFH 80H > 7FH = 128 > 127 = True = 80H 80H > 7FH = 128 > 127 = True = 80H
Least-significant element:
80H > 7FH = 128 > 127 = True = 80H 7FH > 0FH = 127 > 15 = True = 7FH
Using the larger number to form the result in Gr10, we get Gr10 = FFFF80FF8080807FH The result produced by executing the pmin1.u instruction is simply formed by using the other number in each calculation for each element of data, to produce: Gr10 = 00007FFE7F7F7F0FH
TeamUnknown Release
Chapter 8 - Multimedia Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
PARALLEL SHIFT INSTRUCTIONS The parallel shift instructions are similar to the integer shift instructions introduced in Chapter 6. They are also used to shift the bits of a source operand to the left or right by a specified number of bit positions. However, the parallel shift instructions perform the shift left or shift right operation on the values of each of the independent elements of multimedia data in the source register. Table 8.8: Parallel Shift Instruction Formats Mnemonic
Operation
Format
pshl
Parallel shift left
(qp) pshl2 r1=r2,r3(qp) pshl2 r1=r2,count5(qp) pshl4 r1=r2,r3(qp) pshl4 r1=r2,count5
pshr
Parallel shift right
(qp) pshr2 r1=r3,r2(qp) pshr2 r1=r3,count5(qp) pshr4 r1=r3,r2(qp) pshr4 r1=r3,count5
The parallel shift left (pshl) instruction independently shifts the bits in each element of the source operand left by the number of bit positions specified with a count. Table 8.8 gives the two general formats of this instruction. The r2 argument specifies the general register of the source operand that contains the values whose bits are to be shifted. Notice that the instruction can process multimedia data that are structured 4x16 or 2x32. The individual values of data are treated as unsigned integer numbers. After the shift takes place, the resulting new value for each of the elements of data is placed in the destination general register specified by r1. The two formats of the pshl instruction differ in the way that the shift count is specified. In the first format, the count is specified by a value that has been loaded into the general register identified by r3. The other format has the count specified by immediate operand count5. In both cases, the count is interpreted as an unsigned number and can range from 1 through 15 for shifts of 4x16 structured data and from 1 through 31 for 2x32 data.
As the shift takes place, the vacated bits at the least-significant end of each element of data are filled with 0s. If a shift count above either of these ranges is specified, the result produced for each element of data in r1 is 0. The operation of a pshl2 instruction is illustrated in Figure 8.13.
Figure 8.13: 4x16 Parallel Shift Left Operation The formats of the parallel shift right (pshr) instruction in Table 8.1 are identical to those described for the shift left instruction. However, the shift right operation can be performed for either signed or unsigned integer number. If the instruction is written with a .u completer attached to the mnemonic, the values of the data elements in the source and destination registers are interpreted as unsigned numbers. This operation is known as a logical shift. If no completer is appended, the numbers are assumed to be signed integers and an arithmetic shift operation is performed. Again, the count can be specified in a source register or as an immediate operand, and the ranges given for the shift left instruction apply. When an arithmetic shift takes place, the elements of data in the source register specified by r3 are each independently shifted to the right a number of bit positions equal to either count5 or the count in the general register identified by r2. The vacated mostsignificant bit positions of each element are filled with the value of the sign bit in the original value of that data element. The logic shift right operation is identical except that in this case vacated higher-order bits are filled with 0s. Example
If the value in Gr5 is 00FF7FFF8080807FH, what results are produced in Gr10 when the instructions that follow are executed? (p0) pshl2 r10=r5,4H (p0) pshr2 r10=r5,4H Solution Execution of the pshl2 instruction causes the values of the two 32-bit unsigned integer elements of data in Gr5 to be independently shifted left by four bit positions. The vacated lower-order bit positions are filled with 0s. This gives: 00FF7FFFH ® 0FF7FFF0H 8080807FH ® 080807F0H Therefore, the result in the destination register is Gr10 = 0FF7FFF0080807F0H The second instruction treats the two elements of data in Gr5 as signed 32-bit numbers, shifts them four bit positions to the right, and fills the vacated higher-order bits with the values of their original sign bit. 00FF7FFFH ® 000FF7FFH 8080807FH ® F8080807H This gives the result: Gr10 = 000FF7FFF8080807H
TeamUnknown Release
Chapter 8 - Multimedia Instructions Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
PARALLEL DATA ARRANGEMENT INSTRUCTIONS The Itanium processor’s instruction set includes an instruction that can rearrange the order of the individual elements of multimedia data within a single register. This operation is known as a multiplex and is performed by the multiplex (mux) instruction. The mux instruction is one of four parallel data arrangement instructions in the multimedia instruction group. The other three instructions that perform data arrangement functions are called mix (mix), pack (pack), and unpack (unpack). They differ from mux in that they implement different methods of merging elements of data from two source registers into a destination register.
Multiplex Instruction The mux instruction is used to rearrange the organization of the data elements within a register. The general formats of this instruction are given in Table 8.9. Notice that the original set of multimedia data elements is located in the general register specified by r2. This set of data can be structured as either eight 8-bit elements or four 16-bit elements. After processing by the mux operation, the result is placed in the destination register defined by r1. Table 8.9: Data Arrangement Instruction Formats Mnemonic
Operation
Format
mux
Multiplex
(qp) mux1 r1=r2,mbtype4(qp) mux2 r1=r2,mbtype8
mix
Mix
(qp) mix1 r1=r2,r3(qp) mix2 r1=r2,r3(qp) mix4 r1=r2,r3
pack
Pack
(qp) pack2 r1=r2,r3(qp) pack4 r1r2,r3
unpack
Unpack
(qp) unpack1 r1=r2,r3(qp) unpack2 r1=r2,r3(qp) unpack4 r1=r2,r3
Using the mux instruction, the elements of data in the source operand register can be rearranged in a number of different ways. The muxpermutation type (mbtype) source operand selects the new data arrangement. As shown in Table 8.9, the mux permutation type operand of the mux1 instruction format has just four bits. Therefore, mux1 is limited to the five different permutations, which are identified in Table 8.10 as reverse, mix, shuffle, alternate, and broadcast. For example, specifying @rev as the mux permutation type reverses the order of the elements of data in source operand r2 and places this result in destination register r1. Table 8.10: mux Permutations for 8-Bit Elements mbtype4
Function
@rev
Reverse the order of the bytes
@mix
Perform a mix operation on the two halves of r2
@shuf
Perform a shuffle operation on the two halves of r2
@alt
Perform an alternate operation on the two halves of r2
@brcst
Perform a broadcast operation on the least significand byte of r2
The operation performed by the following instruction is demonstrated in Figure 8.14(a). (p0) mux1 r1 = r2,@rev Notice that the 000 element of the source operand is placed into the 111 element position in the destination operand, while the 111 element of the source operand ends up in position 000 of the destination. As another example, the broadcast operation is illustrated in Figure 8.14(b). This result is produced by executing the following instruction: (p0) mux1 r1 = r2,@brcst
Figure 8.14: 8x8 mux Operations Notice that the least-significant element of data (element 000) in source r2 is copied to all eight data locations in destination r1. The data rearrangement operations performed by the mix (@mix), shuffle (@shuf), and alternate (@alt) types on 8x8 structured elements of data are shown in Figure 8.14(c), (d), and (e), respectively. The mux2 instruction format supports an eight-bit mux permutation type operand, mbtype8. This larger operand permits all possible permutations of the four 16-bit elements to be coded into the instructions. In this case, the new order of the bits is simply coded into an 8-bit hexadecimal number that is used as the operand. Figure 8.15(a) demonstrates the coding needed to reverse the order of the elements. The positions of the four elements in the original set of data are coded with the indices 11, 10, 01, and 00. Index 00 represents the least-significant element position and 11 the mostsignificant element position. To rearrange the order of the elements, a mux permutation type is formed by simply listing the indices for the elements in the order they are to appear in the new arrangement. The 8-bit binary number that results is converted to a 2-
digit hexadecimal number that is used as mbtype8. For instance, to reverse the order of the elements, mbtype8 is formed as follows: 00 01 10 11 = 000110112 = 1BH
Figure 8.15: 4x16 mux2 Operations The instruction that performs this operation is written as: p0 mux2 r1 = r2,1BH Figure 8.15(b) shows another example. Here, a broadcast of the third element (10 element) is performed by coding the mux permutation operand with the following value: 10 10 10 10 = 101010102 = AAH The instruction is written as: (p0) mux2 r1 = r2,0AAH
Example If the value in Gr5 is 00FF7FFF8080807FH, what results are produced in Gr10 when the instruction that follows is executed? (p0) mux2 r10 = r5,0D8H Solution The original data elements are arranged as follows: 00 = 807FH 01 = 8080H 10 = 7FFFH 11 = 00FFH Decoding the mux permutation type operand gives: mbtype8 = D8H = 110110002 = 11 01 10 00 Therefore, the results of the mux operation are: Gr10 = 00FF80807FFF807FH
Mix Instruction The data arrangement operation performed by the mix instruction selects elements of parallel data from two source registers and merges them into a destination register. Looking at the formats of the mix instruction in Table 8.9, we see that instructions mix1, mix2, and mix4 are provided to process 8x8, 4x6, or 2x32 structured parallel data, respectively. As shown in Figure 8.16, two different mix operations can be performed by the mix2 instruction. The upper mix function is known as a mix left and is selected by appending the instruction with the left
Figure 8.16: 4x16 mix Operations (.l) completer. Executing the mix left (mix.l) instruction interleaves the values of the odd-numbered elements (elements 11 and 01) of source operands r2 and r3 to produce the result in destination r1. The mix right (mix.r) instruction performs the same function, but uses the even-numbered elements 10 and 00. Example If the values in Gr5 and Gr6 are 00FF7FFF8080807FH and FF0080FE7F7F7F0FH, respectively, what results are produced in Gr10 when the instruction that follows is executed? (p0) mixl.r r10=r5,r6 Solution When the instruction is executed, the four even-numbered 8-bit elements of data from Gr5 and Gr6 are mixed together to form the result in Gr10. This operation is illustrated in general in Figure 8.17. The even-numbered elements of Gr5 are: Element 110 = FFH Element 100 = FFH Element 010 = 80H Element 000 = 7FH The even-numbered elements of Gr6 are: Element 110 = 00H Element 100 = FEH
Element 010 = 7FH Element 000 = 0FH Combining these elements, the result in Gr10 equals: Gr10 = FF00FFFE807F7F0FH
Figure 8.17: Operation of the mix1.r Instruction
Pack Instruction The pack instruction also performs a data arrangement operation by selecting elements of parallel data from two source registers and merging them together in the destination register. Table 8.9 shows that there are two forms of the instruction, pack2 and pack4. The pack2 format processes signed source operands but can produce either a signed or unsigned result. The type of result produced is determined by a saturation completer appended to the instruction mnemonic. The saturation completers supported by the pack instruction are given in Table 8.2. Notice that a signed result is produced with the instruction pack.sss and an unsigned result is obtained by the instruction pack.uss. The pack4 format only supports a signed result. The operation of the pack2 instruction is illustrated in Figure 8.18. Notice that it reduces the eight 16-bit signed numbers in source operands Gr2 and Gr3 to 8-bit signed or unsigned numbers and packs them in destination operand Gr1. When the 16-bit numbers in the source registers are converted to 8-bit numbers, the eight least significant bits are extracted and written to the destination register. If the value of a source element cannot be represented in the number of bits provided for it in the result, saturation is applied. Table 8.3 lists the upper and lower saturation limits for the pack instruction. For instance, if the value of one of the elements in a source operand of a pack2.sss instruction is 02FFH, it is too large to be represented in eight bits. Therefore, the value placed in the destination operand is 7FH, which equals the signed upper saturation limit. If an element is larger than the upper limit value, the result is the upper limit value; if it is smaller than the lower limit value, the result is the lower limit value.
Figure 8.18: 4x16 Pack Operation Example If the values in Gr5 and Gr6 are 00FF7FFF8080807FH and FF0080FE7F7F7F0FH, respectively, what results are produced in Gr10 when the instruction that follows is executed? (p0) pack4.sss r10=r5,r6 Solution When the instruction is executed, the four 32-bit signed numbers of data from Gr5 and Gr6 are individually converted to 16-bit numbers, then packed together to form the result in Gr10. This operation is illustrated in general in Figure 8.19. The 16-bit elements of Gr5 are formed as follows: Element 00 = 8080807FH = 8000H Element 01 = 00FF7FFFH = 7FFFH The 16-bit elements of Gr6 are formed as follows: Element 00 = 7F7F7F0FH = 7FFFH Element 01 = FF0080FEH = 8000H Packing these elements into Gr10 gives: Gr10 = 80007FFF7FFF8000H
Figure 8.19: Operation of the pack4.sss Instruction
Unpack Instruction
The last data arrangement instruction in the instruction set is the unpack instruction. As shown in Table 8.9, unpack1, unpack2,and unpack4 instruction formats are provided to process 8x8,4x16, or 2x32 structured parallel data. Moreover, each instruction can be appended with the completer high (.h) or low (.l) to define which data from source operands r2 and r3 are unpacked into the destination register r1. These three instructions perform the same unpack operation on their associated-size parallel data. The operations of the unpack2.h and unpack2.l instructions are illustrated in Figure 8.20. Notice that the unpack high operation merges the upper two data elements of the source operands by interleaving them into the destination registers. The unpack low operation performs the same operation for the lower two elements of the destination registers.
Figure 8.20: 4x16 Unpack Operations Example If the values in Gr5 and Gr6 are 00FF7FFF8080807FH and FF0080FE7F7F7F0FH, respectively, what results are produced in Gr10 when the instruction that follows is executed? (p0) unpack4.l r10=r5,r6 Solution When the instruction is executed, the lower 32-bit data from Gr5 and Gr6 are unpacked together to form the result in Gr10. This operation is illustrated in general in Figure 8.21. The lower 32-bit element of Gr5 is Element 00 = 8080807FH The lower 32-bit element of Gr6 is Element 00 = 7F7F7F0FH
Unpacking these elements into Gr10 gives: Gr10 = 8080807F7F7F7F0FH
Figure 8.21: Operation of the unpack4.l Instruction
TeamUnknown Release
Chapter 9 - Floating-point Architecture Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Chapter 9: Floating-point Architecture The Itanium processor’s floating-point hardware includes four independent floating-point execution units—two extended-precision units and two single-precision units—and a large, dedicated file of registers. These resources deliver world-class performance—peak performance of 3 gigaflops for extended precision or 6 gigaflops for single precision. The Itanium processor’s floating-point architecture is fully compliant with the ANSI/IEEE Standard for Binary Floating- point Arithmetic (Std. 754-1985). To further enhance the performance of floating-point applications, the Itanium architecture includes extensions such as a combined multiply and add instruction, minimum and maximum calculation instructions, and instructions that enable high-bandwidth memory accesses. This chapter examines the floating-point data types, the register resources, and the operations that are performed by the floating-point instructions.
FLOATING-POINT DATA Floating-point numbers are called real numbers because they can represent signed integers, fractions, or mixed numbers. Floating-point notation represents numbers written in scientific notation. A floating-point number contains three elements: a sign, biased exponent, and significand. The Itanium architecture supports a variety of numeric formats. These variations include IEEE compatible single-precision real numbers, double-precision real numbers in hardware, and quad precision through software. In addition to the IEEE real number data types, the architecture directly supports signed and unsigned 64-bit integers, and an 82-bit floating-point register format. Instructions can be combined to implement 128-bit floating-point arithmetic operations efficiently, but that is beyond the scope of this book.
IEEE Real Numbers These floating-point number types differ in the ranges and coding allowed for the biased exponent and the number of bits of precision for the significand. Table 9.1 lists the
parameters that define each of the IEEE real number types. They are the sign bit, the maximum and minimum size of the unbiased exponent, and the exponent bias and precision. Notice that the IEEE double-extended real number is characterized as being positive or negative, having an unbiased exponent that ranges from a maximum of +16383 to a minimum of –16382, a bias value of + 16383, and a significand expressed with 64 bits of precision. Table 9.1: IEEE Real-data Type Ranges IEEE real-type parameters
Single
Double
Double -extended
Quad
Sign
+or –
+or –
+or –
+or –
Emax
+127
+1023
+16383
+16383
Emin
-126
-1022
-16382
-16382
Exponent bias
+127
+1023
+16383
+16383
24
53
64
113
Precision (bits) Example
What is the maximum/minimum range of biased exponents that can be used to express an IEEE single-precision floating-point number? Solution The maximum value of biased exponent is found by adding +7F to the Emax value of the exponent. This gives: Emax biased = + 127 + 127 = 254 = 0FEH The minimum value of biased exponent is found as: Emin biased = - 126 + 127 = 1 = 1H
As an example, let us convert the number + 125.375 to IEEE single-precision floatingpoint notation. The first step is the sign bit. Since the number is positive, the sign bit is 0. Sign = 0 Next, the decimal number is converted to binary form and expressed in scientific notation, as follows: 125.375 = 1111101.0112 = 1.111101011 x 26 The form of the binary number that results in the scientific notation is known as the significand. Relative to Itanium architecture, the significand is partitioned into the integer
piece called the explicit integer, which is the 1 to the left of the point. The other part, all bits to the right of the point, represents the fractional part of the significand. Table 9.1 shows that the precision for IEEE single precision is 24 bits. This means that single-precision real numbers are always coded with 24 bits. When expressed with scientific notation, the integer part of any non-zero normalized real number is 1. Therefore, it takes up just one bit of the precision, leaving 23 bits for coding the fraction. If the fractional piece can be coded in less than 23 bits, the remaining bits are extended with 0s. On the other hand, if the fractional part of the number extends beyond this 23-bit limit, it must be rounded to fit into the allowed range. In this way, we see that the significand for our example is: Significand = 1.11110101100000000000000 Rounding can lead to a growing error in large, repetitive floating-point calculations. Sometimes the solution to rounding error is simply to express the data with more precision. That is, instead of using single-precision floating-point numbers, express the data used in the computations as double-precision or extended double-precision real numbers. To express data as a single-precision number, Table 9.1 says that the exponent must be biased by +127 (7FH). The biased exponent is obtained by adding 7FH to the original value of the exponent. For our example, the exponent is 610 (6H) and gives the following results: 6H + 7FH = 85H Biased exponent = 10000101 Let us next look at how the earlier number would be represented as a double-precision floating-point number. Table 9.1 shows that the only differences are that the exponent bias is +1023 (=3FFH) and that the precision is expanded to 53 bits. Re-biasing the exponent for double precision, we get: 6H + 3FFH = 405H Biased exponent = 10000000101 Increasing the precision of the significand gives this result: 1.1111010110000000000000000000000000000000000000000000 Example Determine what decimal number is coded as: Sign bit = 0
Biased exponent = 86H Significand = 1.11110101100000000000000 Solution Extracting the exponent from the biased exponent, we get: Exponent = 86H - 7FH = 7H And combining the exponent and significand to express the number in scientific notation gives: +1.11110101100000000000000 x 27 Finally, expressing as a binary number and converting to decimal results in: +11111010.1100000000000000 = +250.75
IEEE Real Number Memory Formats Now that we have introduced the properties of IEEE real numbers and showed how to translate between decimal and floating-point numbers, let us continue by examining the requirements for storing real numbers in memory. Because different exponent biasing range and significand precision are defined for single-, double-, and extended-precision real numbers, each of these real number types takes a different number of bits of storage in memory. Table 9.2 summarizes the memory requirements for each type of real number. Table 9.2: IEEE Real-data Type Memory Formats IEEE memory formats
Single
Double
Doubleextended
Quad
Total memory format width (bits)
32
64
80
128
Sign field width (bits)
1
1
1
1
Exponent field width (bits)
8
11
15
15
Significand field width (bits)
23
52
64
112
Earlier we found that the IEEE single-precision real numbers are characterized by a single sign bit, a biased exponent in the range from 81H (100000012) through 7EH (11111102), and a significand with 24 bits of precision. The binary form of the biased exponent shows that the complete range can be coded with a maximum of eight bits.
Also, since the integer bit of the significand is always 1, it is assumed to be an implied 1 and not stored in memory. Therefore, the IEEE real number memory formats allow one less bit than the precision for storage of the significand. In this way, we see that the total number of bits of memory required to hold a single-precision real number includes 1 bit for the sign, 8 bits for the biased exponent, and 23 bits for the significand, for a total of 32 bits. These requirements are consistent with those given under the single-precision column in Table 9.2. Since IEEE double-precision numbers are expressed with a larger exponent bias (+3FFH = 111111111112) and more bits of precision (53 bits), more bits of memory are required to hold this type of number. Table 9.2 shows that the memory format has a total of 64 bits, which represents 1 bit for a sign, 11 for a biased exponent, and 52 for the significand.
TeamUnknown Release
Chapter 9 - Floating-point Architecture Itanium Architecture for Software Developers byWalter Triebel Intel Press 2000
Recommend this title?
FLOATING-POINT APPLICATION PROGRAMMING MODEL Chapter 3 introduced the application registers of the Itanium processor. At that time, we did not cover the registers that are provided to support the floating-point architecture. The registers that are used during floating-point operations are those in the floating-point register file, the floatingpoint status register, and control bits in the user mask. This section describes each of these register resources, their function, and instructions that affect their operation.
Floating-point Register File The registers in the floating-point register file are used to hold source and destination operands for floating-point computations. Figure 9.1 shows the structure of the floating-point register file. Notice that it contains 128 independent floating-point registers, which are labeled Fr0 through Fr127 . Each of these registers is 82 bits wide. The large floating-point register file reduces the number of load and store operations in applications and results in increased floating-point performance. Programs at all privilege levels can access data in the floating-point registers.
Figure 9.1: Floating-point Register File Similar to the general register file, the floating-point register file is organized into two partitions. In this case, they are called the static registers and rotating registers. Registers Fr0 through Fr31 are used for the static part of the floating-point register file. Fr0 and Fr1 have fixed values, and when accessed as a source operand, their values are always read as +0.0 and +1.0, respectively. If either of these registers is used as a destination, a fault exception occurs as the attempt is made to write data into the register. The other 96 registers, Fr 32 through Fr127 , form the rotating floating-point register group. These registers can be used statically as source and destination operands, but have the added ability to rotate. Rotating floating-point source and destination registers are needed to implement softwarepipelined loops that employ floating-point computations. For instance, the instructions br.ctop and br.cexit , which are employed in modulo-scheduled counted loops, both perform register renaming to rotate the floating-point registers. The rrb.fr bits of the frame marker register are used to rename the floating-point registers. When we examined the general registers, we found that each register had a NaT bit associated with it to track the occurrence of a deferred exception during speculation. Figure 9.1 shows that the floating-point registers do not have a similar bit. A deferred exception can also result during a speculative load of a floating-point register. If this happens, the occurrence of the deferred exception is recorded by loading a special value into the destination floating-point register. This value is called the not a thing value (NaTVal).
Floating-point Status Register Itanium architecture permits the user to define the floating-point computational model under software control. The computational model is indicated either in the encoding of the instruction or by the setting of control bits in the floating-point status register (FPSR), which is actually application register 40. The FPSR contains control bits that are used to specify the floating-point operating characteristics and status bits that contain information about the current floating-point operation. Using the control bits, the programmer can set the precision and range of the result, the rounding mode, and the response to traps. Figure 9.2 shows that the FPSR contains five fields, the trap field, a main status field (sf0 ), and three alternate status fields (sf1 , sf2 , and sf3 ). Use of the alternate status fields result in increased performance in some applications. For instance, the alternate status fields are required to support speculative loading of floating-point registers from memory. All of these 13-bit floating-point status fields have the same format.
Figure 9.2: Floating-point Status Register Format Let us begin by examining the function of the bits of the trap field. The six least significant bits of FPSR are the trap bits that enable and disable floating-point exception functions. Table 9.3 lists the names of each of these bits and describes their function. For example, bit zero is denoted traps.vd and stands for invalid operation floating-point exception. For an invalid operation detected during the execution of a floating-point instruction to initiate an exception, this trap must be enabled. Clearing traps.vd to 0 enables the exception function and setting it to 1 turns off the trap. The bits of the trap field can be individually set or cleared under software control. traps.vd 0 Invalid operation floating-point exception fault (IEEE trap) disabled when this bit is set traps.dd 1 Denormal/unnormal operand floating-point exception fault disabled when this bit is set traps.zd 2 Zero divide floating-point exception fault (IEEE trap) disabled when this bit is set traps.od 3 Overflow floating-point exception trap (IEEE trap) disabled when this bit is set
traps.ud 4 Underflow floating-point exception trap (IEEE trap) disabled when this bit is set traps.id 5 Inexact floating-point exception trap (IEEE trap) disabled when this bit is set
Table 9.3: Trap Fields of the Floating-point Status Register Field
Bit
Description
Figure 9.3 shows that the status field contains five control fields and six status flags. The names of the status flags are listed in Table 9.4 . Notice that one bit corresponds to each of the floating-point trap functions. As long as a trap is enabled, its corresponding status flag is updated dynamically during each floating-point computation. If an exception associated with a flag occurs as an instruction executes, the corresponding status bit is set to record this event. For instance, if the trap.od bit of the FPSR is clear when an overflow occurs as the result of processing a floatingpoint instruction, bit 10 (overflow flag) in the status field is set to 1. In this way, we see that the flags in the status field provide a record of the types of floating-point exceptions that have occurred. A floating-point error service routine can read these bits to identify what type of error has taken place and initiate the appropriate recovery process.
Figure 9.3: Floating-point Status Field Format v 7 Invalid operation (IEEE flag) d 8 Denormal/unnormal operand z
9 Zero divide (IEEE flag) o 10 Overflow (IEEE flag) u 11 Underflow (IEEE flag) i 12 Inexact (IEEE flag)
Table 9.4: Status Flags of the FPSR Status Field Field
Bit
Name
Earlier, we pointed out that the control information in the FPSR is used to enable or disable floating-point capabilities and to configure the characteristics of the floating-point computation model. Table 9.5 identifies each of the control bits and fields of the floating-point status field. These control fields can be individually set or reset through software. ftz 0 Flush-to-zero mode disable wre 1 Widest range exponent pc 3:2 Precision control rc 5:4 Rounding control td 6
Traps disable
Table 9.5: Control Fields of the FPSR Status Field Field
Bit
Name
Notice that the 1-bit field labeled td stands for traps disable. This bit acts as a global enable/disable control for the trap functions. When this bit is reset to 0, the individual floating-point exception functions, such as overflow trap and underflow trap, are enabled and are turned on or off by their corresponding trap field. On the other hand, if td is set to 1, all of the floating-point trap functions are disabled independent of the settings of the individual trap fields. The floating-point computational model is defined by the rounding control (rc ) and the widest range exponent (wre ) fields of the status field of the FPSR and by either appending a precision control (.pc ) completer to the instruction or using the content of the precision control (pc ) field of the status field. For instance, Table 9.6 shows how the rounding can be configured with the 2-bit rc field. Notice that loading rc with the value 11 selects the zero rounding mode. When in this mode, a result that is inexact is rounded by simply truncating extra bits. rc 00 Nearest (or even) 01 Infinity (down) 10 Infinity (up) 11 Zero (truncate/chop)
Table 9.6: Floating-point Rounding Control Definitions Field
Value
Rounding mode
The exponent range and precision of the significand of the result of a floating-point computation are determined as indicated in Table 9.7 . Notice that if the instruction is coded with a precision control completer, the value of pc in the status field is ignored. For now, we will focus on computational models that are not affected by the completer of the instruction. These entries have none listed under the completer column in Table 9.7 , meaning that a .pc completer is not appended to the mnemonic of the floating-point instruction. .s
Ignored 0 24 bits 8 bits IEEE real single .d Ignored 0 53 bits 11 bits IEEE real double .s Ignored 1 24 bits 17 bits Register file range, single precision .d Ignored 1 53 bits 17 bits Register file range, double precision None 00 0 24 bits 15 bits IA-32 stack single None 01
0 N/A N/A Reserved None 10 0 53 bits 15 bits IA-32 stack double None 11 0 64 bits 15 bits IA-32 double-extended None 00 1 24 bits 17 bits Register file range, single precision None 01 1 N/A N/A Reserved None 10 1
53 bits 17 bits Register file range, double precision None 11 1 64 bits 17 bits Register file range, double-extended precision Not applicable Ignored Ignored 24 bits 8 bits Apair of IEEE real single Not applicable Ignored Ignored 64 bits 17 bits Register file range, double-extended precision
Table 9.7: Floating-point Computational Model Control Definitions Computational model control fields Instruction’s .pc completer
FPSR.sfx’s dynamic pc field
FPSR.sfx’s dynamic wre field
Computational model selected Significand precision
Exponent range
Computational style
As an example, let us consider the case of pc = 00 and wre = 1. Looking at the table, we see that this defines a floating-point number format with 24 bits of significant precision and exponent range up to 17 bits. This computation style is called register file range, single precision . If pc is
changed to 10 and wre left at 1, the precision of the significand increases to 24 bits, but the exponent range remains 17 bits. This style of computation is known as register file range, double precision . An alternate way of specifying these same computation styles is to set the wre bit of the status field to 1 and append completer .s and .d , respectively, to the mnemonic of the floatingpoint instruction. Example From Table 9.7 determine how the floating-point operands of an instruction are specified to be in the IEEE real single-precision and double-precision styles. Solution Looking at the table, we find that a floating-point instruction expects its operands to be expressed as IEEE real single-precision or double-precision coded numbers under the following conditions: The wre bit in the status field of the FPSR is 0 The precision of the data is specified with the completer .s or .d , respectively, in the instruction.
Floating-point Status Register Status Field Instructions Let us next look at the operation of the instructions that are used to check or modify the state of the fields in the floating-point status register. Table 9.8 identifies three instructions that affect the status fields of the floating-point registers. They are the floating-point clear flags (fclrf ) instruction, floating-point set controls (fsetc ) instruction, and floating-point check flags (fchkf ) instruction. fclrf Floating-point clear flags (qp) fclrf.sf fsetc Floating-point set controls (qp) fsetc.sf amask7,omask7 fchkf Floating-point check flags (qp) fchkf.sf target25
Table 9.8: Floating-point Status Register Instruction Formats
Mnemonic
Operation
Format
As its name implies, the fclrf instruction provides the ability to clear all six flag bits in the status field of the main or an alternate status field of the floating-point status register. Table 9.8 gives the format of this instruction. Notice that the instruction mnemonic must be appended with a completer that names one of the status fields. The notation used to specify each field is listed in Table 9.9 . For example, the instruction needed to clear the flags in alternate status field 3 to 0 is written as: fclrf.s3 .s0 (or none) Main status field (sf0) .s1 Alternate status field 1 (sf1) .s2 Alternate status field 2 (sf2) .s3 Alternate status field 3 (sf3)
Table 9.9: Status Field Completers sf
Status field accessed
The fsetc instruction permits the control bits of a status field to be individually set or reset. The format of the instruction in Table 9.8 shows that it has two immediate operands called immediateAND mask (amask7 ) and immediate-OR mask (omask7 ). An example is the instruction: fsetc.s0 FFH,0EH Assume that the current content of the control field is 00010102 (0AH). When the instruction is executed, the value of the control field in the main status field is read, then ANDed with the value of amask7 to give: 0001010 • 1111111 = 0001010 Then, this result is ORed with the value of omask7 to give: 0001010 + 0001110 = 0001110 Reloading this value into the main status field retains the wre field at 1 and changes the pc field from 10 to 11. The result is a change of computational style from register file range, double precision to double-extended precision.
Earlier, we pointed out that the alternate status fields are needed to support speculation. When a speculative floating-point operation is performed, a temporary copy of the status flags is made in one of the alternate status fields. After a speculative execution operation is successfully completed, the main status field must be checked against the information held in the temporary copy in an alternate status field. This check operation is the function of the floating-point check flags instruction. The format of the fchkf instruction is given in Table 9.8 . The immediate operand, target25 , is the branch-to address of a service routine that corrects for an active exception in a speculative operation. The following is an example of the instruction: fchkf.s1 fault_service When this instruction is executed, the states of the flags in alternate status field 1 are compared to those in the main status field and the setting of the trap fields of the FPSR. If both sets of flags are the same, nothing needs to be done and they are correct. However, the instruction initiates a speculative operation fault when either of the following conditions applies: The flags of the alternate status field identify an active floating-point exception whose function is enabled in the FPSR trap field. An active exception is found that is not already marked by a status bit in the main status field. This exception is serviced by transitioning to the fault-handler routine pointed to by the address displacement fault_service .
Floating-point Fields of the User Mask Earlier, we pointed out that bits in the user mask (UM) part of the processor status register are affected by a floating-point operation. In the UM of Figure 9.4 , these bits are identified as mfl and mfh . Table 9.10 shows that they indicate whether or not the result of a floating-point operation was written to a lower floating-point register (static register) or upper floating-point register (rotating register), respectively. For instance, logic 1 in mfl means that the operation performed by a floating-point instruction, which uses a register in the range Fr2 through Fr31 as the destination of the result, has completed.
Figure 9.4: User Mask Format mfl 4
Lower (Fr2 .. Fr31 ) floating-point register written mfh 5 Upper (Fr32 .. Fr127 ) floating-point register written
Table 9.10: User Mask Floating-point Field Description Field
TeamUnknown Release
Bit
Description
Chapter 9 - Floating-point Architecture Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
FLOATING-POINT REGISTER FORMAT AND DATA TYPES ENCODING In the last section, we introduced the registers that are involved in floating-point operation, and noted that the source and destination operands of floating-point computations are held in the registers of the floating-point register file. This data can be either integer or real-type numbers. This section continues the study of the Itanium processor’s floating-point architecture by examining the organization of data held in the floating-point registers.
Floating-point Register Format Earlier, we pointed out that the Itanium architecture supports an 82-bit floating-point register format. This format is designed to accommodate both integer and real-type numbers. Remember that the three elements of a real-type floating-point number are the sign bit, biased exponent, and significand. Figure 9.5 shows how the floating-point registers are partitioned to accommodate these fields. Notice that the 64 least significant bits, b0 through b63, hold the significand. The next 17 bits,b64 through b80 contain the biased exponent, and the most-significant bit, b81, is the sign bit. The significand field is composed of an explicit integer bit (b63) and 63 bits for the fraction part of the significand(b0 through b62).
Figure 9.5: Format of a Floating-point Register This floating-point register format has enough bits to represent a variety of different types of floating-point values. Table 9.11 shows the types of values, whether or not they are supported, and how they are encoded. The different value types are identified as classes and subclasses. For example, the normalized number class includes the subclasses of IEEE single-precision, double-precision, and extended double-precision numbers. The classes of value encoding that are shaded in Table 9.11 are unsupported.
Table 9.11: Floating-point Register Encoding Significand i.bb...bb; explicit integer bit is shown (64 bits)
Class or subclass
Sign (1 bit)
Biased exponent (17 bits)
NaNs
0/1
0x1FFFF
1.000...01 through 1.111...11
0/1
0x1FFFF
1.100...00 through 1.111...11
1
0x1FFFF
1.100...00
0/1
0x1FFFF
1.000...01 through 1.011...11
Infinity
0/1
0x1FFFF
1.000...00
Pseudo-NaNs
0/1
0x1FFFF
0.000...01 through 0.111...11
Pseudo-infinity
0/1
0x1FFFF
0.000...00
Normalized numbers (floatingpoint register format normals)
0/1
0x00001 through 0x1FFFE
1.000...00 through 1.111...11
0
0x1003E
1.000...00 through 1.111...11
0
0x1003E
1.000...00
IEEE single-real normals
0/1
0x0FF81 through 0x1007E
1.000...00...(40) 0s through 1.111...11...(40)0s
IEEE double-real normals
0/1
0x0FC01 through 0x103FE
1.000...00...(11)0s through 1.111...11...(11)0s
IEEE double-extended real normals
0/1
0x0C001 through 0x13FFE
1.000...00 through 1.111...11
0/1
0x0C001
1.000...00 through 1.111...11
Quiet NaN Quiet NaN indefinite Signaling NaNs
Integers or parallel FP (large unsigned or negative signed integers) Integer indefinite
Normal numbers with the same value as doubleextended real pseudodenormals
Significand i.bb...bb; explicit integer bit is shown (64 bits)
Sign (1 bit)
Biased exponent (17 bits)
IA-32 stack single-real normals (produced when computational model is IA32 stack single)
0/1
0x0C001 through 0x13FFE
1.000...00...(40)0s through 1.111...11...(40)0s
IA-32 stack double-real normals (produced when computational model is IA32 stack double)
0/1
0x0C001 through 0x13FFE
1.000...00...(11)0s through 1.111...11...(11)0s
0/1
0x00000 0x00001 through 0x1FFFE 0x00001 through 0x1FFFD 0x1FFFE
0.000...01 through 1.111...11 0.000...01 through 1.111...11
Class or subclass
Unnormalized numbers (Floatingpoint register format unnormalized numbers)
1
0.000...00 0.000...00
Integers or parallel FP (positive signed/unsigned integers)
0
0x1003E
0.000...00 through 0.111...11
IEEE single-real denormals
0/1
0x0FF81
0.000...01...(40)0s through 0.111...11...(40)0s
IEEE double-real denormals
0/1
0x0FC01
0.000...01...(11)0s through 0.111...11...(11)0s
Register format denormals
0/1
0x00001
0.000...01 through 0.111...11
Unnormal numbers with the same value as IEEE doubleextended real denormals
0/1
0x0C001
0.000...01 through 0.111...11
IEEE double-extended real denormals
0/1
0x00000
0.000...01 through 0.111...11
0/1
0x00000
0.000...01...(40)0s through 0.111...11...(40)0s
IA-32 stack single-real denormals (produced when computational model is IA32 stack single)
Significand i.bb...bb; explicit integer bit is shown (64 bits)
Sign (1 bit)
Biased exponent (17 bits)
0/1
0x00000
0.000...01...(11)0s through 0.111...11...(11)0s
Double-extended real pseudo denormals (IA-32 stack and memory format)
0/1
0x00000
1.000...00 through 1.111...11
Pseudo-zeros
0/1
0x00001 through 0x1FFFD 0x1FFFE
0.000...00
Class or subclass IA-32 stack double-real denormals (produced when computational model is IA32 stack double)
1
0.000...00
NaTVal
0
0x1FFFE
0.000...00
Zero
0/1
0x00000
0.000...00
Fr0 (positive zero)
0
0x00000
0.000...00
Fr1 (positive one)
0
0x0FFFF
1.000...00
IEEE Real Numbers Table 9.11 shows how the IEEE real numbers and other special purpose values and data types are encoded in the Itanium processor’s floating-point register format. This table lists three pieces of information for each class of data: permitted sign bit values, range of allowed biased exponent values, and range of possible significand values. The IEEE real numbers are those identified in the table as IEEE single-real normals, IEEE double-real normals, and IEEE double-extended real normals. Let us look more closely at the information provided in Table 9.11 for the single-precision real numbers. As expected, the allowed values of the sign bit are 0 for positive and 1 for negative. Moreover, the significand is expressed with 1 integer bit and 23 fraction bits. Note that the integer bit of the significand is not implied. For this reason, the allowed significand range is expressed as: 1.00000000000000000000000 £ Significand £ 1.11111111111111111111111 The notation (40) 0s that follows the upper and lower bound in Table 9.11 means that the unused lower 40 bits of the register are filled with zeros. Finally, the range of biased exponent is expressed as:
0FF81H = Biased exponent = 1007EH This biased exponent range differs from that found for the IEEE real single-precision computational style in “IEEE Real Numbers,” because this is a different IEEE real singleprecision number computational style. This style, which is called register file range, single precision, supports the full 17-bit exponent of the register format. Therefore, the exponent is biased by adding +65535 (FFFFH), instead of +127 (3FFH). Example What are the ranges of the biased exponent and significand for IEEE double-real normals? What value is added to form the biased exponent? Solution Table 9.11 shows that the significand range is 11 bits less than the 64 bits available in the register format. Therefore, the double precision significand is 53 bits in length. The biased exponent range is from 0FC01H through 103FEH. So, + 65535 (FFFFH) is again used as the bias.
Special Numbers and Other Data Types Some special numbers and other data types are encoded by using specific exponents or significands outside the range of the real normal data types. Let us first look at a few special numbers. Two examples are + and -8. They are encoded with sign bit 0 for + or 1 for -, all 1s in the biased exponent (1FFFFH), l as the integer bit of the significand, and all 0s in the fractional part of the significand. For instance, +8 is coded and represented in floating-point registers as 1FFFF8000000000000000H as follows: Sign bit = 0 = + Biased exponent = 11111111111111111 = 1FFFFH Signif = 1.000000000000000000000000000000000000000000000000000000000000000 = 8000000000000000H Other examples are the coding of IEEE signed zeros, +0 and –0. These two values are encoded with the appropriate sign bit and all 0s in both the biased exponent and significand fields. In these ways, we see that +0 and a –0 are coded in a floating-point resister as 000000000000000000000H and 200000000000000000000H, respectively. Let us next look at how some other data types are encoded. Earlier, we pointed out that
the floating-point registers also can hold integer numbers. Notice in Table 9.11 that integer numbers are always coded with 0 in the sign field and with the fixed biased exponent value 1003EH. The 64-bit value of the integer is placed in the significand field. The integer can be a signed or unsigned number. If it represents a signed number, the sign is in bit 63, not bit 81. An example of a special data type is the NaTVal that is used to mark the occurrence of a differed exception that occurred during a speculative floating-point load instruction sequence. Table 9.11 shows that that NaTVal is coded into a register as the value 1FFFE0000000000000000H. Another example of a special data type is the not a number (NaN) class of numbers. A NaN represents an invalid floating-point result and is identified by all 1s in the biased exponent field, but with a nonzero significand.
Parallel Floating-point Data The instruction set has parallel floating-point instructions and a special parallel floatingpoint data type. This class of data, which is identified as parallel FP in Table 9.11, holds a pair of 32-bit IEEE single real numbers in the significand field. Just like for integer numbers, parallel floating-point data is encoded with 0 in the sign bit and 1003EH in the biased exponent. The distinction between integer and parallel floating-point numbers is the context in which they are used. The two 32-bit IEEE single precision numbers are located in bits 0 through 31 and bits 32 through 63, respectively.
TeamUnknown Release
Chapter 9 - Floating-point Architecture Itanium Architecture for Software Developers byWalter Triebel Intel Press 2000
Recommend this title?
FLOATING-POINT MEMORY ACCESS, REGISTER TRANSFER, AND DATA CONVE INSTRUCTIONS
Let us begin a study of the floating-point instructions with the instructions that enable floating-point data between the registers of the floating-point register file and memory, between the floating-point and gen and converted between integer and floating-point data types. Chapter 5 examined the use of the load i data from memory to a general register. At that time we found that the load and store instructions could transfer information between memory and the floating-point registers. This is the role of the floating-poi instructions. It is also necessary to transfer floating-point information between the general registers and registers. This operation is performed by the register transfer instructions. Finally, the data conversion conversion of a floating-point number in one floating-point register to either a signed or unsigned intege floating-point register. In this section, we examine the operation of these three types of floating-point in
Floating-point Load, Store, and Load-pair Instructions
Let us begin a study of the floating-point instruction set with the floating-point memory access instructio the load, load-pair, and store instructions . Load instructions transfer data from a storage location in me point register or a pair of floating-point registers. The store instruction reverses this operation. It saves point register to a storage location in memory. There are no floating-point store-pair instructions.
The same performance concerns addressed for the integer load operations apply to the loading of floa Examples are minimizing the impact of memory access latency and bus bandwidth. Just like the intege instruction, the floating-point load instruction addresses the need for high performance with speculative latency and load hints that improve caching and bus utilization. However, a new form of the floating-po is also provided that has the ability to load two independent elements of floating-point data into separa
registers. This load pair instruction further improves the use of the bus bandwidth.
Table 9.12 lists the formats of the floating-point load (ldf ) and store (stf ) instructions. Looking at the that they are essentially the same as those described for the integer load and store instructions in Chap instance, the floating-point load type (fldtype ) completers are the same, but with the exceptions that and biased load types are not supported. Moreover, the load hint (ldhint ) completers are identical to for the integer instructions. ldf Floating-point load (qp) ldffsz.fldtype.ldhint f1=[r3] (qp) ldffsz.fldtype.ldhint f1=[r3],r2 (qp) ldffsz.fldtype.ldhint f1=[r3],imm9 (qp) ldf8.fldtype.ldhint f1=[r3] (qp) ldf8.fldtype.ldhint f1=[r3],r2 (qp) ldf8.fldtype.ldhint f1=[r3],imm9 (qp) ldf.fill.ldhint f1=[r3] (qp) ldf.fill.ldhint f1=[r3],r2 (qp) ldf.fill.ldhint f1=[r3],imm9 stf Floating-point store (qp) stffsz.sthint=[r3] f2 (qp) stffsz.sthint=[r3] f2,imm9 (qp) stf8.sthint=[r3] f2 (qp) stf8.sthint=[r3] f2,imm9 (qp) stf.spill.sthint=[r3]f2 (qp) stf.spill.sthint =[r3]f2,imm9 ldfps Floating-point load-pair (qp) ldfps.fldtype.ldhint f1,f2=[r3] (qp) ldfps.fldtype.ldhint f1,f2=[r3],8 (qp) ldfpd.fldtype.ldhint f1,f2=[r3] (qp) ldfpd.fldtype.ldhint f1,f2=[r3],16 (qp) ldfp8.fldtype.ldhint f1,f2=[r3]
(qp) ldfp8.fldtype.ldhint f1,f2=[r3],16 mov Move floating-point register (qp) mov f1=f3 getf Get floating-point value, (qp) getf.s r1=f2 exponent, or significand (qp) getf.d r1=f2 (qp) getf.exp r1=f2 (qp) getf.sig r1=f2 setf Set floating-point value, exponent, or significand (qp) setf.s f1=r2 (qp) setf.d f1=r2 (qp) setf.exp f1=r2 (qp) setf.sig f1= r2 fcvt Convert floating-point to integer (qp) fcvt.fx.sf f1=f2 (qp) fcvt.fx.trunc.sf f1=f2 (qp) fcvt.fxu.sf f1=f2 (qp) fcvt.fxu.trunc.sf f1=f2 Convert signed/unsigned integer to floating-point (qp) fcvt.xf f1=f2 (qp) fcvt.xuf.pc.sf f1=f3 Convert parallel floating-point to integer (qp) fpcvt.fx.sf f1=f2 (qp) fpcvt.fx.trunc.sf f1=f2 (qp) fpcvt.fxu.sf f1 f2 (qp) fpcvt.fxu.trunc.sf f1=f2
Table 9.12: Floating-point Memory Access, Register Transfer, and Data Conversion Instruction F Mnemonic
Operation
Format
The allowed forms of the floating-point load and store instruction are given in Table 9.13 . Finally, the a for floating-point load and store instructions are the same as for integer load and store instructions. ldf ldfp Load ldf.s ldfp.s Speculative load ldf.a ldfp.a Advanced load ldf.sa ldfp.sa Speculative advanced load ldf.c.nc, ldf.c.clr ldfp.c.nc, ldfp.c.clr Check load ldf.fill Fill stf Store stf.spill Spill
Table 9.13: Supported Floating-point Load and Store Instructions
Mnemonic
Operation
Normal
Load-pair
Let us just briefly look at the differences between the integer and floating-point load and store operation instructions process floating-point data, the values of the size mnemonic extension fsz are changed. T process single-precision, double-precision, and extended double-precision floating-point numbers. As s , appending s , d , or e to the mnemonic specifies these sizes, respectively. Also, remember that durin the mfl and mfh bits in the user mask of the processor status register are updated to reflect whether th static or to a rotating floating-point register. s 4 bytes Single precision d 8 bytes Double precision e 10 bytes Extended precision
Table 9.14: fsz Completers fsz
Bytes accessed
Memory format
When held in the register file, single-, double- and extended double-precision data all use the same 82 However, Table 9.14 shows that they are formatted differently when placed in memory. Notice that eac single-precision data is compressed to fit into 4 bytes of memory. In fact, data is placed in memory usin memory format summarized in Table 9.2 . For this reason, the format of floating-point data must be tra the transfer process between the floating-point register file and memory. This format conversion is auto performed by the load and store instructions.
Figure 9.6 (a) shows how the bits of a single-precision floating-point number expressed in 82-bit registe compressed to 4 bytes for storage in memory. First, the 23 bits of the fractional significand are stored in addressed bytes of memory along with the least significant bit of a compressed form of the biased expo the compressed exponent and the sign bit are placed in the fourth byte, which is the highest addressed Compression of this information in this way results in more efficient use of memory.
Figure 9.6: Saving and Loading a Single-precision Floating-point Number from Memory
This memory format raises a question about how the 17 bits of the biased exponent are compacted int Remember that the range of biased exponents for a single-precision floating-point number is from 0FF 1007EH. Notice that only the seven least significant bits, 0 through 6, of the exponent change. Bits 7 th all 1 or 0 and bit 17 is always the complement of this value. In this way, we find that the actual values o must be saved, but the value in bits 7 through 17 can be saved as a single bit equal in value to the mos the exponent. This value is appended as the seventh bit of the compressed exponent. Looking at Figur that when a save is performed, the compressed biased exponent is ANDed with the integer bit, then th memory. Since the integer bit is 1 for non-zero-normalized numbers, infinities, and NaNs, the actual va compressed exponent is saved in memory for these floating-point data types. The integer bit is 0 for flo and denormal numbers. For both of these cases, the AND operation forces the value of the compresse
When the load of a non-zero-normalized number is performed, the operation illustrated in Figure 9.6 (b Notice that the integer bit in the register is set to 1 and the part that holds the fractional significand is re memory. As the fractional significand loads, unused lower-order bits of the register are filled with 0s. No bits of the exponent are directly restored with their corresponding value from memory. Finally, bit 17 of register is loaded with the value of the most significant bit of the compressed exponent read from mem through 16 are loaded with its complement.
As shown in Figures 9.7 and 9.8, the double-precision and extended double-precision number translati in the same way, but take up 8 and 10 bytes in memory, respectively.
Figure 9.7: Saving and Loading a Double-precision Floating-point Number from Memory
Figure 9.8: Saving and Loading a Double-extended Precision Floating-point Number from Memory
The floating-point spill and fill instructions (stf.spill , ldf.fill ) are an exception to this data form process. When executed, they save or restore, respectively, floating-point information using the 82-bit register format. For this reason, a saved element of data takes up 16 bytes of memory. The organizatio information is spilled to memory is shown in Figure 9.9 .
Figure 9.9: Spilling the Content of a Floating-point Register to Memory
The integer forms of load and store (ldf8 , stf8 ) are also exceptions. When an stf8 operation is pe bit and fractional significand are saved in 8 bytes of memory, as shown in Figure 9.10 (a). If an ldf8 in
executed to restore the contents of this memory location to a register, the byte of information is loaded significand field of the destination registers without conversion. Figure 9.10 (b) shows that the exponen destination floating-point register is set to 1003EH, which is the biased exponent value for all integers, made 0.
Figure 9.10: Saving and Loading an Integer Number in Floating-point Memory Format into a Floatin Example
If the value in floating-point register Fr10 is +250.75, express the number in register format, then write a would save this value in memory at the address in register Fr5 . Assume that the address will not be au incremented. What is the value placed in memory? Solution
Expressing the number in Fr10 as a binary number, then expressing it in scientific notation with 32 bits o +1.11110101100000000000000 x 27 Therefore, the components are as follows: Sign bit = 0 Integer bit = 1 Significand = 11110101100000000000000 The biased exponent is found to be: Exponent = 7H + FFFFH = 10006H = 10000000000000110 Finally, expressing the number in register format gives:
+250.75 = 01000000000000011011111010110000000000000000000000000000000000000000000000000
= 1006FAC0000000000000H Notice that the explicit integer bit is represented in register format. The instruction needed to store this value in memory is: stfs [f5]=f10
When the value is saved in memory, it is compacted into the 4-byte memory format. Remember that th following information: Sign bit = 0 Exponent (most-significant bit) = 1 Exponent (least-significant 7 bits) = 0000110 Significand = 11110101100000000000000 Notice that the explicit integer bit is not included in the number. The results are: +250.75 = 01000011011110101100000000000000 = 437AC000H
Two other data types, single-precision pair (8 bytes) and double-precision pair (16 bytes), are supporte precision floating-point load-pair (ldfps ) and double-precision floating-point load-pair (ldfpd ) instruc When executed, these instructions load two single- or double-precision numbers, respectively, from me independent floating-point registers. The formats of the load-pair instructions are given in Table 9.12 . placed on the selection of the two destination registers. For example, if both registers are either static f registers or rotating floating-point registers, then one register must be odd numbered and the other eve the autoincrement form of addressing is limited to a fixed immediate value of 8 for the single-precision instruction and 16 for the double-precision form. An example is the instruction (p0) ldfpd f9,f10=[f5],10H
When this instruction is executed, the 16 bytes of data are read from memory starting at the address s value in Fr5 .
This 128 bit value is treated as two contiguous double-precision floating-point numbers. Each number i floating-point register format and then placed into one of the destination floating-point registers. The low value from memory loads into Fr9 and the higher addressed value into Fr10 . At completion of the memo the address in Fr5 is incremented by 16 (10H). In this way, it automatically points to the next data pair i modify floating-point register bits in the user mask are also updated by parallel load operations.
Floating-point Register Transfer Instructions
Now that we know the method and format for transferring floating-point information between the floating and memory, let us turn our attention to the transfer of data between the registers in the floating-point f general and floating-point register files. This data transfer operation is the role of the following floating-p transfer instructions: move floating-point register (mov ), get floating-point value, exponent, or significan floating-point value, exponent, or significand (setf ). The formats of these instructions are listed in Tab
The move instruction is used to copy floating-point information from one register to another in the floati file. An example is the instruction: (p0) mov f5=f10
When executed, the value in source operand Fr10 is copied into destination Fr5 . Another example is th (p0) mov f32=f1 Execution of this instruction initializes Fr32 with the value: +1.0 = 0FFFF8000000000000000H
The instructions that are used to transfer a complete floating-point number between the general and flo files are denoted as getf.s , getf.d and setf.s , setf.d. Here the .s completer tells that a sin formatted number is to be transferred and .d stands for a double-precision number. An example is the instruction. (p0) getf.s r10=f10
When this instruction is executed, the register-formatted single-precision number held in floating-point r translated into single-precision memory format (see Figure 9.6 (a)), then placed into Gr10 . Since single format requires only the four least-significant bytes of Gr10 , the 32 unused more-significant bits are fille This operation can be reversed by the instruction: (p0) setf.s f10=r10
When this instruction executes, it takes the memory-formatted floating-point value held in Gr10 , re-exp in 82-bit floating-point register format as shown in Figure 9.6 (b), and places the translated value in Fr1 Example
The floating-point number +125.375 is expressed in Fr10 as the value 10005FAC0000000000000H. Wh produced in Gr5 by executing the following instruction? (p0) getf.d r5=f10
Solution Expressing the number in Fr10 in binary form gives:
010000000000000101111110101100000000000000000000000000000000000000000000000000 Extracting the sign bit, integer bit, and 63-bit fractional significand, we get: Sign bit = 0 Integer bit = 1 Signif = 111101011000000000000000000000000000000000000000000000000000000 The 17-bit exponent is: Biased exponent = 10005H = 10000000000000101 Executing the getf.d instruction converts the double-precision number to 8-byte memory format and Gr5 . First, we must extract the following values from the register-formatted data. Sign bit = 0 Exponent (most-significant bit) = 1 Exponent (least-significant 10 bits) = 0000000101 Signif = 1111010110000000000000000000000000000000000000000000 Using this information and the translation process outlined in Figure 9.7 (b) gives: Gr5 = 010000000101111101011000000000000000000000000000 0000000000000000 = 405F580000000000H
A second type of register transfer operation is defined to transfer only the value of the sign and expone of an integer number between a floating-point register and a general register. Table 9.12 shows that ge getf.sig are used to perform these operations for a floating-point to general register transfer. An exa instruction getf.exp r5=f10
Executing this instruction copies the complete 17-bit biased exponent of the integer value in Fr10 to the bits of Gr5 (bits 0 through 16) and the sign bit of the value into bit 17. Again, unused more-significant bi This operation is reversed by executing the instruction: setf.exp f10=r5
When this instruction is executed, the most significant bit of the significand in destination register Fr10 i rest of the significand bits are cleared to 0.
The setf.sig instruction converts the unsigned integer number in a source general register to its equ integer number in a destination floating-point register. Figure 9.10 shows the format of an integer in a f register. The sign bit is always 0, the exponent is fixed as 1003EH, and the lower 64 bits hold the value When executed, the complete 64-bit unsigned integer is copied to the significand part of the specified d register. The getf.sig instruction performs the reverse operation. That is, it is used to extract the inte significand of an integer in a floating-point register and place this value in the destination general regist Example
Assume that the value in Fr10 is +125, equal to 0103E000000000000007DH. What result is produced i the following instruction? (p0) getf.sig r5=f10 Solution
Execution of this instruction extracts the complete 64-bit significand from the value in Fr10 and places it register Gr5 . Expressing the value in Fr10 in binary form gives:
01000000000011111000000000000000000000000000000000000000000000000000000000011 The destination register contents are: Gr5 = 00000000000000000000000000000000000000000000000000000000001111101 = 000000000000007DH
Integer/Floating-point Conversion Instructions
We just found that the getf.sig and setf.sig instructions convert the value of an integer between register format in a floating-point register and integer format in a general register. Here, we examine wa floating-point number in one floating-point register to either a signed or unsigned integer in another floa and vice versa. This type of conversion operation is performed with the convert floating-point to integer (fcvt.fx/fcvt.fxu ) and convert signed/unsigned integer to floating-point (fcvt.xf/fcvt.xfu )
Let us begin by looking at the format and operation of fcvt.fx and fcvt.fxu . Two formats are give instructions in Table 9.12 . The operation performed by the first form of the instruction is to read the flo source register f2; convert it to a signed integer using the rounding method selected by the rc code in the floating-point status register selected by .sf; and place the signed integer that results in the 64-bi destination register f1 . As part of this process, the exponent field of f1 is set to the biased exponent 1 sign field is set to positive (0). The following is an example of the instruction:
fcvt.fx.s0 f5=f10
The second format performs the same operation, but rounds differently. The .trunc completer sets th zero (truncate/chop) method. Example
If Fr10 contains the floating-point number 10005FAC0000000000000H (+125.375), what result is produ executing the following instruction? fcvt.fx.s0 f5=f10 Solution
Executing the instruction converts the floating-point number in Fr10 to an integer number in Fr5 . The flo rounded to 0. This gives:
Fr5 = 0100000000001111100000000000000000000000000000000000000000000000000000000 = 1003E000000000000007DH
The fcvt.fxu instruction formats in Table 9.12 perform the same floating-point to integer operation, b unsigned integer result.
The fcvt.xf/fcvt.xfu instructions convert a 64-bit signed or unsigned integer, respectively, in the point register to a floating-point number in the destination floating-point register. For example, the follow converts the integer number in Fr5 to a double precision (.d ) floating-point numberin Fr 10 . fcvt.xfu.d.s0 f10=f5 Rounding is performed according to the method selected by the rc code in the main status field (.s0 )
The last form of the convert instruction in Table 9.12 is called convert parallel floating-point to integer (fpcvt.fx/fpct.fxu ). When executed, fpcvt.fx converts a pair of single-precision, floating-point significand field of source register f2 to a pair of 32-bit signed integer numbers in destination register f either defined by the rc code in the status field corresponding to the.sf completer or truncated to zero is set to the integer-biased exponent value (1003EH), and the sign field is set to 0 for positive.
TeamUnknown Release
Chapter 9 - Floating-point Architecture Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
FLOATING-POINT ARITHMETIC INSTRUCTIONS In the last section, we showed how to transfer floating-point data between memory and the floating-point registers, and between the general registers and floating-point registers. Once information is held in the floating-point register file, it is ready to be processed with other floating-point operations that are in the Itanium processor’s floating-point instruction set. In this section, the floating-point arithmetic instructions are introduced. These instructions can perform a wide variety of arithmetic operations on floating-point numbers. For example, instructions are provided to find the negative or absolute value of a number, compare two numbers to find the minimum or maximum, and to perform arithmetic operations such as addition, subtraction, or multiplication. A floating-point division instruction is not supported in the instruction set. This arithmetic function must be performed in software.
Arithmetic Computations Involving One Source Operand Register For ease in understanding, the floating-point arithmetic instructions have been organized based on the number of source operands they employ in the computation. Table 9.15 summarizes the group of arithmetic instructions that involve one source operand. This table lists the instruction mnemonic, format, and a brief description of its operations. Note that most floating-point arithmetic instructions have two forms, one form for processing floating-point normal data and a second for processing floating-point parallel data. Since the operand combinations are identical for instructions that perform similar functions, we will just describe the operation of a typical instruction of each type. Table 9.15: Floating-point Arithmetic Instructions that Employ One Source Operand Mnemonic
Operation
Format
Description
fneg
Floating-point negate
(qp) fneg f1=f3
The value in f3 is negated and placed in f1.
(qp) fpneg f1=f3
The pair of single-precision values in the significand field
Mnemonic
Operation
Format
Description values in the significand field of f3 are negated and stored in the significand field of f1. 1003EH ® exponent field of f1 0 (positive) ® sign field of f1
fabs
Floating-point absolute value
(qp) fabs f1=f3
The absolute value of the value in f3 is computed and placed in f1.
(qp) fpabs f1=f3
The absolute values of the pair of single-precision values in the significand field of f3 are computed and stored in the significand field of f1. 1003EH ® exponent field of f1 0 (positive) ® sign field of f1
(qp) fnegabs f1=f3
The absolute value of the value in f3 is computed, negated, and placed in f1.
(qp) fpnegabs f1=f3
The absolute values of the pair of single-precision values in the significand field of f3 are computed, negated, and stored in the significand field of f1.
fnegabs
Floating-point negate absolute value
1003EH ® exponent field of f1 0 (positive) ® sign field of f1 fnorm
Floating-point normalize
(qp) fnorm.pc.sf f1=f3
The value in f3 is normalized and the rounded result is placed in f1.
Notice that the floating-point negate (fneg), floating-point absolute value (fabs), floatingpoint negate absolute value (fnegabs) and floating-point normalize (fnorm) instructions all use a common operand configuration. The instruction formats in Table 9.15 show that they each perform their defining arithmetic function on the number in a single source floating-point register and place the result in the destination floating-point register. For instance, executing the following instruction determines the absolute value of the floatingpoint number in Fr10 and places this result in Fr5.
(p0) fabs f5=f10 If the value in Fr10 is a NaTVal that resulted from a deferred speculation exception, Fr5 is made equal to the NaTVal (1FFFE0000000000000000H), instead of the computed absolute value. Example Assume that the floating-point number in Fr10 is +125.375. What result is found in Fr5 after executing the following instruction? (p0) fneg f5=f10 Solution The original value of the source operand is: Fr10 = +125.375 = 10005FAC0000000000000H When the instruction is executed, the value in Fr10 is converted to its negative and placed in Fr5, as follows: Fr5 = -125.375 = 30005FAC0000000000000H
Many of the floating-point instructions are actually pseudo-operations. That is, the compiler may use them in an application program, but replaces them by another floatingpoint instruction. For example, the floating-point negate instruction from our Example above is actually implemented by a pseudo-op instruction and is automatically replaced by the instruction: (p0) fmerge.ns f5=f10,f10 The parallel forms of these instructions, fpneg, fpabs, fpnegabs, and fpnorm perform the same arithmetic operation, but on the individual single-precision (32-bit) floating-point values in the significand field of the source operand. Remember that a parallel floatingpoint number is expressed in floating-point register format with 0 in the sign field and 1003EH in the exponent field. An example is the instruction: (p0) fpnegabs f5=f10 This instruction independently forms the absolute value of each of the values in the significand of the parallel floating-point number in Fr10, negates these values, and places them in Fr5 using parallel floating-point register format.
Arithmetic Computations Involving Two Source Operand Registers A second group of instructions performs an arithmetic operation on the floating-point values in two source operands and places the result in a destination operand. They are listed in Table 9.16 and include the floating-point maximum and minimum (fmax, fmin), floating-point absolute maximum and minimum (famax, famin), and floating-point add, subtract, multiply, and negative multiply (fadd, fsub, fmpy, and fnmpy) instructions. Table 9.16 shows the general format of the floating-point addition instruction, as follows: (qp) fadd.pc.sf f1=f3,f2 Table 9.16: Floating-point Arithmetic Instructions that Employ Two Source Operands Mnemonic
Operation
Format
Description
fmax
Floatingpoint maximum
(qp) fmax.sf f1=f2,f3
The operand with the larger value is placed in f1.
(qp) fpmax.sf f1=f2,f3
The paired singleprecision values in the significands of f2 or f3 are compared. The operand with the larger value is returned in the significand of f1. 1003EH ® exponent field of f1 0 (positive) ® sign field of f1
min
Floatingpoint minimum
(qp) fmin.sf f1 f2,f3
The operand with the smaller value is placed in f1.
(qp) fpmin.sf f1=f2,f3
The paired singleprecision values in the significands of f2 or f3 are compared. The operand with the smaller value is returned in the significand of f1.
Mnemonic
Operation
Format
Description 1003EH ® exponent field of f1 0 (positive) ® sign field of f1
famax
Floatingpoint absolute maximum
(qp) famax.sf f1=f2,f3
The operand with the larger absolute value is placed in f1.
(qp) fpamax.sf f1=f2,f3
The paired singleprecision values in the significands of f2 and f3 are compared. The operand with the larger absolute value is returned in the significand field of f1. 1003EH ® exponent field of f1 0 (positive) ® sign field of f1
famin
Floatingpoint absolute minimum
(qp) famin.sf f1=f2,f3 (qp) fpamin.sf f1=f2,f3
The operand with the smaller absolute value is placed in f1. The paired single-precision values in the significands of f2 or f3 are compared; the operand with the smaller absolute value is returned in the significand of f1. 1003EH ® exponent field of f1 0 (positive) ® sign field of f1
fadd
Floatingpoint add
(qp) fadd.pc.sf f1=f3,f2
The values in f3 and f2 are added and their sum is placed in f1.
Mnemonic
Operation
Format
Description
fsub
Floatingpoint subtract
(qp) fsub.pc.sf f1=f3,f2
The value in f2 is subtracted from that in f3 and the difference is placed in f1.
fmpy
Floatingpoint multiply
(qp) fmpy.pc.sf f1=f3,f4
The values in f3 and f4 are multiplied and their product is placed in f1.
(qp) fpmpy.sf f1=f3,f4
The pairs of singleprecision values in the significand fields of f3 and f4 are multiplied. The resulting values are then rounded to single precision and the pair of rounded results is stored in the significand field of f1. 1003EH ® exponent field of f1 0 (positive) ® sign field of f1
fnmpy
Floatingpoint negative multiply
(qp) fnmpy.pc.sf f1 f3,f4
The values in f3 and f4 are multiplied and the product is negated. The rounded result is placed in f1.
(qp) fpnmpy.sf f1=f3,f4
The pairs of singleprecision values in the significand fields of f3 and f4 are multiplied and then the products are negated. The resulting values are then rounded to single precision and the pair of rounded results is stored in the significand field of f1. 1003EH ® exponent
Mnemonic
Operation
Format
Description 1003EH ® exponent field of f1 0 (positive) ® sign field of f1
Notice that the instruction mnemonic is appended with two completers: precision control (.pc) and status field (.sf). These completers permit the precision of the data in the operands f1, f2, and f3 to be specified in either of two ways. First, Table 9.17 shows that appending .s or .d as the precision control completer says that the operands are single precision or double precision, respectively. On the other hand, if no .pc completer is used, the .sf completer determines the precision. The allowed values for the status field completer are given in Table 9.9. In this case, the precision control(pc) field of the selected status register in the FPSR defines the precision of the operands for the floating-point addition. Remember that the status field specifies the following: enabled exceptions, rounding mode, exponent width, precision control, and the updates for particular status field flags. If neither a .pc nor .sf completer is appended to the add mnemonic, the values of pc and wre in the main status field determine the precision of the operands. Table 9.17: Precision Control Completer pc
Precision specified
.s
Single
.d
Double
None
Dynamic (Use pc value in status field of FPSR)
As an example, let us examine the operation performed by the following instruction: (p0) fadd.d f10=f5,f6 When this instruction is executed, the floating-point numbers in Fr5 and Fr6 are added, then rounded to double precision using the rounding mode specified by the rc field of the main status field in the FPSR. The resulting double-precision sum is placed in Fr10. If the value in either Fr5 or Fr6 is NaTVal, the value in that register was the result of a speculative operation that resulted in a deferred exception. The deferred exception is propagated by the add operation. That is, NaTVal is written into Fr10 instead of the computed sum. Example If the contents of Fr5 and Fr6 are 1003E0000007C0000007DH and 1003E0000007B0000007EH, respectively, what is the result of executing the following instruction?
(p0) fpmax.s0 f10=f5,f6 Solution First, we will extract the upper and lower 32-bit integers of the parallel data in Fr5 and Fr6. This gives: Fr5U = 0000007CH = 124 Fr5L = 0000007DH = 125 Fr6U = 0000007BH = 123 Fr6L = 0000007EH = 126 Execution of the instruction causes the upper and lower pairs of numbers to be independently compared to determine which is the larger (maximum) number. Comparing Fr5U = 124 and Fr6U = 123, we see that the value in Fr5U is larger. Therefore, Fr10U is made equal to 124. Since the value in Fr6L (126) is larger than that in Fr5L (125), Fr10L equals 126. This gives the result: Fr10 = 1003E0000007C0000007EH
Arithmetic Computations Involving Three Source Operand Registers Table 9.18 lists another group of the floating-point arithmetic instructions. These instructions employ three source operands: floating-point multiply add (fma, fpma), floating-point multiply subtract (fms, fpms), and floating-point negative multiply add (fnma, fpnma). For instance, the general format of the floating-point multiply subtract instruction is: (qp) fms.pc.sf f1=f3,f4,f2 Table 9.18: Floating-point Arithmetic Instructions that Employ Three Source Operands
Mnemonic
Operation
Format
Description
fma
Floatingpoint multiply add
(qp) fma.pc.sf f1=f3,f4,f2
The values in f3 and f4 are multiplied and then the value in f2 is added to the product. The rounded result is placed in f1.
(qp) fpma.sf f1=f3,f4,f2
The pairs of single-precision values in the significand fields of f3 and f4 are multiplied and then the pair of single precision values in the significand field of f2 are added to these products. The resulting values are then rounded to single precision and the pair of rounded results is stored in the significand field of f1. 1003EH ® exponent field of f1 0 (positive) ® sign field of f1 fms
Floatingpoint multiply subtract
(qp) fms.pc.sf f1=f3,f4,f2 (qp) fpms.sf f1=f3,f4,f2
The values in f3 and f4 are multiplied and then the value in f2 is subtracted from the product. The rounded result is placed in f1. The pairs of single-precision values in the significand fields of f3 and f4 are multiplied and then the pair of single precision values in the significand field of f2 are subtracted from these products. The resulting values are then rounded to single precision and the pair of rounded results is stored in the significand field of f1. 1003EH ® exponent field of
Mnemonic
Operation
Format
Description 1003EH ® exponent field of f1 0 (positive) ® sign field of f1
fnma
Floatingpoint negative multiply add
(qp) fnma.pc.sf f1=3,f4,f2
The values in f3 and f4 are multiplied, the product is negated, and then the value in f2 is added to the negated product. The rounded result is placed in f1.
(qp) fpnma.sf f1=f3,f4,f2
The pairs of single-precision values in the significand fields of f3 and f4 are multiplied, negated, and then the pair of single precision values in the significand field of f2 are added to these negated products. The resulting values are then rounded to single precision and the pair of rounded results is stored in the significand field of f1. 1003EH ® exponent field of f1 0 (positive) ® sign field of f1 The arithmetic operation performed by this instruction is given by the expression: (f3 x f4) + f2 ® f1 Again, a .pc or .sf completer is used to specify the precision of the computation and the rounding method. If any of the source operands involved in the multiply-subtract computation contains NaTVal, the result is not used to update destination f1. Instead, f1 is loaded with the NaTVal to propagate the deferred exception. The operations performed by the fma and fms instructions are known as fused multiplyadd and fused multiply-subtract. Since the multiply and add or multiply and subtract function is the core of many repetitive floating-point computations, combination of these operations into a single instruction results in improved performance and reduces rounding error over an implementation with independent multiply and add or subtract instructions. Earlier, we pointed out that many of the floating-point operations are actually pseudo-
operations. For example, both the fadd and fsub instructions are actually pseudo-ops. These floating-point operations are translated to the fma and fms instructions, respectively, as follows: (qp) fadd.pc.sf f1=f3,f2 ® (qp) fma.pc.sf f1=f3,Fr1,f2 (qp) fsub.pc.sf f1=f3,f2 ® (qp) fms.pc.sf f1=f3,Fr1,f2 Remember that the value in Fr1 is fixed at +1.0, which equals 0FFFF8000000000000000H in floating-point register notation. Therefore, the floatingpoint addition is performed with the operation: (f3 x +1.0) + f2 ® f1
Division and Square Root Computations The Itanium architecture specifies two approximation instructions, frcpa and frsqrta, that support efficient and IEEE-correct software implementations of division, square root and remainder computations. Deferring most of the division and square root computations to software offers several advantages. Most importantly, since each operation is broken down into several simpler instructions, these individual instructions can be scheduled more flexibly in conjunction with the rest of the code, increasing the potential for parallelism. In particular: Since the underlying operations are fully pipelined, the division and square root operations inherit the pipelining, allowing high throughout. If a perfectly rounded IEEE-correct result is not required (in graphics applications, for example), faster algorithms can be substituted. Intel provides a number of recommended division and square root algorithms, in the form of short sequences of straight-line code written in assembly language for the Itanium architecture. These code sequences can be inlined by compilers, used as the core of mathematical libraries, or called on as macros by assembly language programmers. These algorithms are anticipated to serve all the needs of typical users, who when using a high-level language may be unaware of how division and square root are actually implemented. All the algorithms provided by Intel have been carefully designed to provide IEEE-correct results and trigger IEEE flags and exceptions appropriately. Subject to this correctness constraint, they have been written to maximize performance on the Itanium processor, the first silicon implementation of the Itanium architecture. However, they are likely to be the most appropriate algorithms for future processors, even those with significantly different hardware characteristics.
The operation of this last group of floating-point instructions is summarized in Table 9.19. The arithmetic computations performed by these instructions, floating-point reciprocal approximation (frcpa, fprcpa) and floating-point square-root approximation (frsqrta, fprsqrta) are actually approximations. These instructions are typically used to obtain initial approximations used in efficient and IEEE-correct software implementations of division, square root, and remainder computations. Here’s an example: (p0) frcpa.s1 f10,p5=f5,f6 Table 9.19: Floating-point Arithmetic Instructions that Perform Approximations Mnemonic
Operation
Format
Description
frcpa
Floating-point reciprocal approximation
(qp) frcpa.sf f1,p2=f2,f3
If qp=0, p2 ® 0; f1 remains unchanged. If qp 1, p2 ® 1; f1 is set to an approximation of the reciprocal of the value in f3. If qp=0, p2 ® 0; f1 remains unchanged. If qp=1, p2 ® 1; each half of the significand of f1 is set to an approximation of the reciprocal of the value in the corresponding half of f3.
(qp) fprcpa.sf f1,p2=f2,f3
frsqrta
Floating-point reciprocal square root approximation
(qp) frsqrta.sf f1,p2=f3
(qp) fprsqrta.sf f1,p2=f3
1003EH ® exponent field of f1 0 (positive) ® sign field of f1 If qp=0, p2 ® 0; f1 remains unchanged. If qp=1, p2 ® 1; f1 is set to an approximation of the reciprocal square root of the value in f3. If qp=0, p2 ® 0; f1 remains unchanged. If qp=1, p2 ® 1; each half of the significand of f1 is set to an approximation of the reciprocal square root of the value in the
Mnemonic
Operation
Format
Description the value in the corresponding half of f3. 1003EH ® exponent field of f1 0 (positive) ® sign field of f1
Executing the instruction loads Fr10 with an approximation of the reciprocal calculation that follows: Fr10 = 1/Fr6 Then, predicate bit Pr5 is set to 1. If the value in either Fr5 or Fr6 is NaTVal due to a failed speculative load, the result produced in Fr10 is NaTVal, instead of the computed result and Pr5 is set to 0. The results in Table 9.19 for the approximation instructions are those produced when the reciprocal calculation is made using valid normalized numbers. If the value in either f2 or f3 is equal to ± infinity, ± zero, pseudo-zero, a NaN, or an unsupported data type, then the destination f1 is filled with the IEEE-754 mandated quotient f2/f3, and p2 is set to 0. If the values in f2 and f3 are such that the approximation of the reciprocal of the value in f3 may cause division or square root sequences to fail to produce the correct IEEE-754 results, then a floating-point exception fault for software assist occurs. The operating system must intervene to calculate the IEEE-754 result, place this result in f1, and set p2 to 0.
Single-precision Floating-point Divide, Latency Optimized The Intel® algorithms and assembly code are provided, at no cost, on-line as “Divide, Square Root, and Remainder Algorithms for the IA-64 Architecture” at http://developer.intel.com/software/opensource/numerics/ . We will provide one divide example here in this book. This example is taken directly from the Intel information on-line. The example, computes a full IEEE single-precision divide, with minimal latency, as quickly as possible. An algorithm with higher throughput is possible, and that algorithm should be used if throughput is more important, which is often the case when a divide operation is used inside a loop. Details for a throughput-optimized version is given in the Intel on-line information sited previously, as are algorithms for double precision and a full set of square root algorithms.
Since a single hardware implementation of division would force a choice between being optimized for latency or throughput, we can see that the Itanium architecture gives us flexibility to tailor our divide to our needs. We can even choose to optimize for many applications, such as graphics, which may not need the full IEEE result, giving even higher throughput. While using the approximation methods makes coding a little more challenging, the payoff in performance is significant since we can tailor the method to the needs of our program. Example The following algorithm calculates q’3=a/b in single precision, where a and b are single precision numbers. The rn subscript denotes the IEEE round to nearest mode, and rnd denotes any IEEE rounding mode. All other symbols used are 82-bit register format numbers. The precision for each step is shown below. 1. y0 = 1/b . (1 = e0), çe0ç< 2-8.886
table lookup (frcpa)
2. q0 = (a . y0)rn
82-bit register format precision
3. e0 = (1 - b . y0)rn
82-bit register format precision
4. q1 = (q0 + e0 . q0)rn
82-bit register format precision
5. e1 = (e0 . e0)rn
82-bit register format precision
6. q2 = (q1 + e1 . q1)rn
82-bit register format precision
7. e2 = (e1 . e1)rn
82-bit register format precision
8. q3 = (q2 + e2 . q2)rn
17-bit exponent, 53-bit mantissa
9. q’3 = (q3)rnd
single precision
Solution The assembly language implementation is: .file "sgl_div_min_lat.s" .section .text .proc sgl_div_min_lat# .align 32 .global sgl_div_min_lat# .align 32 sgl_div_min_lat: { .mmi alloc r31=ar.pfs,3,0,0,0 // r32, r33, r34 // &a is in r32 // &b is in r33 // &div is in r34 (the address of the divide result)
// general registers used: r31, r32, r33, r34 // predicate registers used: p6 // floating-point registers used: f6, f7, f8 nop.m 0 nop.i 0;; } { .mmi // load a, the first argument, in f6 ldfs f6 = [r32] // load b, the second argument, in f7 ldfs f7 = [r33] nop.i 0;; } { .mfi // BEGIN SINGLE PRECISION MINIMUM LATENCY DIVIDE ALGORITHM nop.m 0 // Step (1) // y0 = 1 / b in f8 frcpa.s0 f8,p6=f6,f7 nop.i 0;; } { .mfi nop.m 0 // Step (2) // q0 = a * y0 in f6 (p6) fma.s1 f6=f6,f8,f0 nop.i 0 } { .mfi nop.m 0 // Step (3) // e0 = 1 – b * y0 in f7 (p6) fnma .s1 f7=f7,f8,f1 nop.i 0;; } { .mfi nop.m 0 // Step (4) // q1 = q0 + e0 * q0 in f6 (p6) fma.s1 f6=f7,f6,f6 nop.i 0 } { .mfi nop.m 0 // Step (5)
// e1 = e0 * e0 in f7 (p6) fma. s1 f7=f7,f7,f0 nop. i0;; } { .mfi nop.m 0 // Step (6) // q2 = q1 + e1 * q1 in f6 (p6) fma.s1 f6=f7,f6,f6 nop.i 0 } { .mfi nop.m 0 // Step (7) // e2 = e1 * e1 in f7 (p6) fma.s1 f7=f7,f7,f0 nop.i 0;; } { .mfi nop.m 0 // Step (8) // q3 = q2 + e2 * q2 in f6 (p6) fma.d.s1 f6=f7,f6,f6 nop.i 0;; } { .mfi nop.m 0 // Step (9) // q3’ = q3 in f8 (p6) fma.s.s0 f8=f6,f1,f0 nop.i 0;; // END SINGLE PRECISION MINIMUM LATENCYDIVIDE ALGORITHM } { .mmb nop.m 0 // store result stfs [r34]=f8 // return br.ret.sptk b0;; } .endp sgl_div_min_lat
Double-precision Floating-point Square Root, Latency Optimized
The Intel algorithms and assembly code are provided, at no cost, on-line as “Divide, Square Root, and Remainder Algorithms for the IA-64 Architecture” at http://developer.intel.com/software/opensource/numerics/. We will provide one square root example here that is taken directly from the Intel information on-line. The example computes a full IEEE double precision square root, with minimal latency, as quickly as possible. An algorithm with higher throughput is possible, and that algorithm should be used when throughput is more important such as is often the case when a square root operation is used inside a loop. Details for a throughput optimized version is given in the Intel on-line reference sited previously, as are algorithms for double precision versions. Since a single hardware implementation of square root would force a choice between being optimized for latency or throughput, we can see that the Itanium architecture gives us flexibility to tailor our square root to our needs. Example The following algorithm calculates S = Ö–a in double precision, where a is a double precision number. The rn subscript denotes the IEEE round to nearest mode, and rnd denotes any IEEE rounding mode. All other symbols used are 82-bit register format numbers. The precision used for each step is shown below. 1. y0 = (1/Ö –a) . (1 + e0), çe0ç<2-8.831
table lookup (frsqrta)
2. h = (0.5 . y0)rn
82-bit register format precision
3. g = (a . y0)rn
82-bit register format precision
4. e = (0.5 - g . h)rn
82-bit register format precision
5. S0 = (1.5 + 2.5 . e)rn
82-bit register format precision
6. e2 = (e . e)rn
82-bit register format precision
7. t = (63/8 + 231/16 . e)rn
82-bit register format precision
8. S1 = (e + e2 . S0)rn
82-bit register format precision
9. e4 = (e2 . e2)rn
82-bit register format precision
10. t1 = (35/8 + e . t)rn
82-bit register format precision
11. G = (g + S 1 . g)rn
82-bit register format precision
12. E = (g . e4)rn
82-bit register format precision
13. u = (S1 + e4 . t1)rn
82-bit register format precision
14. g1 = (G + t1 . E)rn
17-bit exponent, 53-bit mantissa
15. h1 = (h + u . h)rn
82-bit register format precision
16. d = (a - g1 . g1)rn
82-bit register format precision
17. S = (g1 + d . h1)rnd
double precision
Solution The assembly language implementation is: .file "dbl_sqrt_min_lat.s" .section .text .proc dbl_sqrt_min_lat# .align 32 .global dbl_sqrt_min_lat# .align 32 dbl_sqrt_min_lat: { .mmb alloc r31=ar.pfs,2,0,0,0 // r32, r33 // &a is in r32 // &sqrt is in r33 (the address of the sqrt result) // general registers used: r2, r3, r31, r32, r33 // predicate registers used: p6 // floating-point registers used: f6 to f13 // load the argument a in f6 ldfd f6 = [r32] nop.b 0 } { .mlx
//BEGIN DOUBLE PRECISION MINIMUM LATENCY SQUARE ROOT ALGORITHM nop.m 0 // exponent of +1/2 in r2 movl r2 = 0x0fffe;; } { .mfi // +1/2 in f9 setf.exp f9 = r2 nop.f 0 nop.i 0 } { .mlx nop.m 0 // 3/2 in r3 movl r3=0x3fc00000;; } { .mfi setf.s f10=r3 // Step (1) // y0 = 1/sqrt(a) in f7 frsqrta.s0 f7,p6=f6 nop.i 0;; } { .mlx nop.m 0 // 5/2 in r2 movl r2 = 0x40200000 } { .mlx nop.m 0 // 63/8 in r3 movl r3 = 0x40fc0000;; } { .mfi setf.s f11=r2 // Step (2) // h = +1/2 * y0 in f8 (p6) fma.s1 f8=f9,f7,f0 nop.i 0 } { .mfi setf.s f12=r3 // Step (3) // g = a * y0 in f7 (p6) fma.s1 f7=f6,f7,f0 nop.i 0;; } { .mlx nop.m 0 // 231/16 in r2 movl r2 = 0x41670000;; } { .mfi
setf.s f13=r2 // Step (4) // e = 1/2 - g * h in f9 (p6) fnma. s1 f9=f7,f8,f9 nop.i 0 } { .mlx nop.m 0 // 35/8 in r3 movl r3 = 0x408c0000;; } { .mfi setf.s f14=r3 // Step (5) // S = 3/2 + 5/2 * e in f10 (p6) fma.s1 f10=f11,f9,f10 nop.i 0 } { .mfi nop.m 0 // Step (6) // e2 = e * e in f11 (p6) fma.s1 f11=f9,f9,f0 nop.i 0 } { .mfi nop.m 0 // Step (7) // t = 63/8 + 231/16 * e in f12 (p6) fma.s1 f12=f13,f9,f12 nop.i 0;; } { .mfi nop.m 0 // Step (8) // S1 = e + e2 * S in f10 (p6) fma.s1 f10=f11,f10,f9 nop.i 0 } { .mfi nop.m 0 // Step (9) // e4 = e2 * e2 in f11 (p6) fma.s1 f11=f11,f11,f0 nop.i 0 } { .mfi
nop.m 0 // Step (10) // t1 = 35/8 + e * t in f9 (p6) fma.s1 f9=f9,f12,f14 nop.i 0;; } { .mfi nop.m 0 // Step (11) // G = g + S1 * g in f12 (p6) fma.s1 f12=f10,f7,f7 nop.i 0 } { .mfi nop.m 0 // Step (12) // E = g * e4 in f7 (p6) fma.s1 f7=f7,f11,f0 nop.i 0 } { .mfi nop.m 0 // Step (13) // u = S1 + e4 * t1 in f10 (p6) fma.s1 f10=f11,f9,f10 nop.i 0;; } { .mfi nop.m 0 // Step (14) // g1 = G + t1 * E in f7 (p6) fma.d.s1 f7=f9,f7,f12 nop.i 0;; } { .mfi nop.m 0 // Step (15) // h1 = h + u * h in f8 (p6) fma.s1 f8=f10,f8,f8 nop.i 0;; } { .mfi nop.m 0 // Step (16) // d = a - g1 * g1 in f9 (p6) fnma.s1 f9=f7,f7,f6
nop.i 0;; } { .mfi nop.m 0 // Step (17) // g2 = g1 + d * h1 in f7 (p6) fma.d.s0 f7=f9,f8,f7 nop.i 0;; // END DOUBLE PRECISION MINIMUM LATENCY SQUARE ROOT ALGORITHM } { .mmb nop.m 0 // store result stfd [r33]=f7 // return br.ret.sptk b0;; } .endp dbl_sqrt_min_lat
Optimizing a Floating-point Computation Using the Fused Multiplyadd Operation Many applications are characterized by repeating floating-point computations. For this reason, the latency of floating-point operations can have a significant impact on program performance. The FORTRAN routine in Figure 9.11(a) represents a recurrence computation. When executed, this routine computes a series of10 values of B5 and saves them in memory. A computation involves a recurrence relationship if the result to be computed depends on the value of a prior result. Notice that the new value of B5 depends on its previous value. The performance of this computation is impacted by the fact that the value of B5(k+KB5I) is dependent on the value of STB5. The dependency between these floating-point numbers may limit parallelism in the solution implemented by the compiler.
191
DO 191 k= 1,10 B5(k+KB5I)= SA(k) + STB5 * SB(k) STB5= B5(k+KB5I) - STB5 CONTINUE
(a) FORTRAN recurrence computation routine L1:
ldf f32=[r5],8 ldf f42=[r6],8 fmul f5=f7,f42 fadd f6=f32,f5 stf [r7]=f6,8 fsub f7=f6,f7 br.ctop L1
;; ;; ;;
// // // // // // //
Load SA(k) Load SB(k) Temp result Compute B5 Store B5 Compute STB5 Repeat
(b) Implementation with separate multiply and add operations L1:
ldf f32=[r5],8 ldf f42=[r6],8 fma f6=f7,f42,f32 stf [r7]=f6,8 fsub f7=f6,f7 br.ctop L1
;;
;;
// // // // // //
Load SA(k) Load SB(k) Compute B5 Store B5 Compute STB5 Repeat
(c) Improved implementation using fused multiply-add instruction Figure 9.11: Optimizing a Recurrence Floating-point Computation Figure 9.11(b) shows a non-optimized implementation of this software routine. Notice that independent floating-point multiply and add instructions are used to implement the computation of the value of B5 in floating-point register Fr6. The dependency between this computation and the temporary result in floating-point register Fr6means that there must be an instruction group boundary between the fmul and fadd instructions. Since a multiply followed by an add operation is the basis of many floating-point computations, a special fused multiply-add (fma) instruction is defined in the instruction set. Figure 9.11(c) shows how the compiler improves the implementation of the FORTRAN routine using this instruction. Use of fma increases parallelism by eliminating the dependency and instruction group boundary. Also, performance is increased because the execution latency of this single instruction is less than the total combined latency for the individual multiply and add instructions. Using the fma instruction also improves accuracy. When independent multiply and add instructions are used, both instructions contribute a rounding error to the floating-point computations. The repeated calculation further compounds this error. The fused multiply-add operation offers the advantage of a single rounding error for the pair of calculations, giving more accurate results. The compiler can perform other optimizations, such as pipelining and loop unrolling, to
improve the performance of this floating-point routine.
TeamUnknown Release
Chapter 9 - Floating-point Architecture Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
INTEGER MULTIPLICATION During our examination of the integer arithmetic in Chapter 6, we indicated that the floating-point unit performs the integer multiplication operation. Integer multiplication is performed on data resident in the floating-point registers with an instruction known as the fixed-point multiply (xmpy) instruction. Just like floating-point division, the integer division operation must be performed in software. Table 9.20 shows the formats of the xmpy instruction. This instruction treats the values in source operand registers f3 and f4 as either 64-bit signed or unsigned integers. That is, the sign and exponent fields of the data in the floating-point registers are ignored. These 64-bit numbers are multiplied to produce a 128-bit product; however, only the upper or lower 64 bits of the result are placed in destination operand register f1. For example, the following instruction multiplies the unsigned 64-bit numbers in Fr5 and Fr6 and places the upper 64-bits of this product in Fr10. (p0) xmpy.hu f10=f5,f6 Table 9.20: Integer Multiply Instruction Formats Mnemonic
Operation
Format
xmpy
Floating-point multiply
(qp) xmpy.l f1=f3,f4 (qp) xmpy.lu f1=f3,f4 (qp) xmpy.h f1=f3,f4 (qp) xmpy.hu f1=f3,f4
xma
Fixed-point multiply add
(qp) xma.l f1=f3,f4,f2 (qp) xma.lu f1=f3,f4,f2 (qp) xma.h f1=f3,f4,f2 (qp) xma.hu f1=f3,f4,f2
This result is expressed as an integer in floating-point register format. Therefore, the exponent field of Fr10 is set to the integer-biased exponent value 1003EH, and its sign field is made 0 for positive.
A second type of multiply instruction is provided in the instruction set. The format of this instruction, fixed-point multiply add (xma), is also given in Table 9.20. The following instruction is an example: (p0) xma.lu f10=f5,f6,f7 When executed, it performs the 64-bit unsigned integer computation: (Fr5 x Fr6) + Fr7 ® Fr10 The value in Fr7 is zero-extended to 128-bits before it is added to the product. Since the .lu completer is appended to the mnemonic, just the lower 64 bits of the unsigned 128bit result are written in integer floating-point register notation into Fr10. For both instructions, if the value in any of the source operands is NaTVal, the destination operand is made NaTVal instead of the computed result.
TeamUnknown Release
Chapter 9 - Floating-point Architecture Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
INTEGER DIVIDE AND INTEGER REMAINDER COMPUTATIONS The Itanium architecture does not provide any integer divide or remainder operations, and these too are to be implemented by software. Intel provides some recommended algorithms that have been proved correct mathematically and designed to work efficiently. Integer division works by transferring the operands to floating-point registers, performing an approximate floating-point division, and truncating the answer before returning it to an integer register. The remainder operation is based on integer division; one more multiply-subtract is performed at the end. The integer divide and remainder computations are not affected by the rounding mode set by the user in the FPSR main status field (sf0). All floating-point operations use the reserved status field sf1, thus they are performed in register file size format (17-bit exponent, 64-bit mantissa) and rounded to nearest mode. More efficient algorithms for short integers are also provided by Intel. The fastest algorithms are for 8-bit operands, and the slowest but most general for 64-bit operands. Also, 16-bit and 32-bit variants of intermediate speed are also provided. For 8- and 16-bit operands, the fastest algorithms use iterative subtraction instead of floating-point operations. The Intel algorithms and assembly code are provided, at no cost, on-line as “Divide, Square Root, and Remainder Algorithms for the IA-64 Architecture” at http://developer.intel.com/software/opensource/numerics/.
TeamUnknown Release
Chapter 9 - Floating-point Architecture Itanium Architecture for Software Developers byWalter Triebel Intel Press 2000
Recommend this title?
FLOATING-POINT LOGIC INSTRUCTIONS The floating-point logic instructions perform functions similar to those of the integer logic instructions that were introduced in Chapter 6 ; however, they process data organized in floating-point register format. Table 9.21 shows that the basic logic functions supported for integer data—AND, OR, exclusive-OR, and AND complement—are also implemented with floating-point instructions. Moreover, one new function called floating-point select is available. fand Floating-point logical AND (qp) fand f1=f2,f3 The bit-wise logical AND of the significand fields of f2 and f3 is computed. The resulting value is stored in the significand field of f1. 1003EH exponent field of f1 0 (positive) sign field of f1 for Floating-point logical OR (qp) for f1=f2,f3 The bit-wise logical OR of the significand fields of f2 and f3 is computed. The resulting value is stored in the significand field of f1. 1003EH exponent field of f1 0 (positive) sign field of f1
fxor Floating-point exclusive OR (qp) fxor f1=f2,f3 The bit-wise logical exclusive-OR of the significand fields of f2 and f3 is computed. The resulting value is stored in the signifcand field of f1. 1003EH exponent field of f1 0 (positive) sign field of f1 fandcm Floating-point AND complement (qp) fandcm f1=f2,f3 The bit-wise logical AND of the significand field of f2 with the bit-wise complemented significand field of f3 is computed. The resulting value is stored in the significand field of f1. 1003EH exponent field of f1 0 (positive) sign field of f1 fselect Floating-point select (qp) fselect f1=f3,f4,f2 The significand field of f3 is logically ANDed with the significand field of f2 and the significand field of f4 is logically ANDed with the 1’s complement of the significand field of f2. The two results are logically ORed together. The result is placed in the significand field of f1. 1003EH exponent field of f1 0 (positive) sign field of f1
Table 9.21: Floating-point Logic Instructions Mnemonic
Operation
Format
Description
Table 9.21 provides the format and a description of the operation of each logic instruction. An example is the instruction: (qp) fand f5=f10,f11 Executing the instruction performs the logic operation: Fr10 • Fr11 Fr5 That is, an individual logical AND operation is performed on each pair of corresponding bits
in the signficands of source operands Fr10 and Fr11 , and the result is placed in the corresponding bit location in the significand of destination operand Fr5 . As part of this logical computation, the biased exponent field is made 1003EH, and the sign bit is cleared to 0 to identify a floating-point integer result. If either of the source operand registers contains NaTVal, the destination is set to NaTVal instead of the computed result. Example If the values in Fr10 , Fr11, and Fr12 are 1003E000000000000007DH, 1003E00000000000000FEH, and 1003EFFFFFFFFFFFFFFF0H, respectively, what result is produced in Fr6 after executing the instruction sequence that follows? (p0) for f5=f10,f11 (p0) fandcm f6=f5,f12 Solution Expressing the significands of the two source operands in binary form, we get: Fr10 = 000000000000007DH = 0000000000000000000000000000000000000000000000000000000001111101 Fr11 = 00000000000000FEH = 0000000000000000000000000000000000000000000000000000000011111110 Performing the OR logic operation gives: Fr5 = Fr10 + Fr11 = 0000000000000000000000000000000000000000000000000000000011111111 Now, the significand value in Fr12 is complemented, then ANDed with the significand of Fr5 to give: Fr12 = FFFFFFFFFFFFFFF0H = 0000000000000000000000000000000000000000000000000000000000001111 Fr6 = Fr5 • Fr12 = 000000000000000000000000000000000000000000000000000000000001111 Expressing the result in floating-point notation gives:
Fr6 = 1003E000000000000000FH The floating-point select (fselect ) instruction performs a special logic operation with the significands of the three source operands. This logical function is expressed as follows: f3 • f2 + f4 • f2 ® f1
TeamUnknown Release
Chapter 9 - Floating-point Architecture Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
FLOATING-POINT DATA ARRANGEMENT INSTRUCTIONS In Chapter 8, we examined the operation of the parallel data arrangement instructions. The instruction set provides other instructions that rearrange data, but they process floating-point data in the floating-point registers. Table 9.22 lists the six floating-point data arrangement instructions and their instruction formats. Note that each instruction performs its respective data arrangement operation— merge, mix, pack, swap, or sign extend—on the significands of the floating-point numbers in source registers specified by f2 and f3 and places the result in the significand of the destination register identified by f1. If either source operand of the data arrangement instruction equals NaTVal, the result produced in the destination operand is NaTVal, not the computed result. Table 9.22: Floating-point Data Arrangement Instruction Formats Mnemonic
Operation
Format
fmerge
Floating-point merge
(qp) fmerge.ns f1=f2,f3 (qp) fmerge.s f1=f2,f3 (qp) fmerge.se f1=f2,f3
fpmerge
Floating-point parallel merge
(qp) fpmerge.ns f1=f2,f3 (qp) fpmerge.s f1=f2,f3 (qp) fpmerge.se f1=f2,f3
fmix.l
Floating-point parallel mix
(qp) fmix.l f1=f2,f3 (qp) fmix.r f1=f2,f3 (qp) fmix.lr f1=f2,f3
fpack
Floating-point pack
(qp) fpack f1=f2,f3
Mnemonic
Operation
Format
fswap
Floating-point swap
(qp) fswap f1=f2,f3 (qp) fswap.nl f1=f2,f3 (qp) fswap.nr f1=f2,f3
fsxt
Floating-point sign extend
(qp) fsxt.l f1=f2,f3 (qp) fsxt.r f1=f2,f3
Floating-point Merge Operation Let us begin by examining the operations performed by the floating-point merge (fmerge) and floating-point parallel merge (fpmerge) instructions. The formats in Table 9.22 identify three different merge operations, which are selected by appending the instruction mnemonic with the completers merge negative sign (.ns), merge sign (.s), and merge sign and exponent (.se). The following instruction is an example: (p0) fmerge.ns f5=f6,f7 When this instruction is executed, the value of the negated sign bit in Fr6 is merged with the exponent and significand from the value in Fr7 to form a new value in Fr5. This operation is demonstrated in Figure 9.12(a).
Figure 9.12: Floating-point Merge Operations An example of the merge sign version of the instruction is: (p0) fmerge.s f5=f6,f7 Figure 9.12(b) shows that it performs the same function, except that the sign is not complemented. The merge sign and exponent form of the instruction performs a slightly different operation. Notice in Figure 9.12(c) that both the sign and biased exponent from the source register specified by f2 are merged together with the significand from the source register identified by f3 in the destination register corresponding to f1. The operations performed by the fpmerge instruction are identical to those just described for the fmerge instruction. However, it performs the operation specified by the completer independently on each element of the pair of single-precision numbers in the source and destination operand registers. Figures 9.13(a), (b), and (c) show the parallel negative sign merge, parallel sign merge, and parallel sign and exponent merge operations, respectively. When a parallel merge operation takes place, the value of the biased exponent in f1 is made 1003EH, and the sign is made 0 to reflect positive integer results.
Figure 9.13: Parallel Floating-point Merge Operations
Floating-point Parallel Mix Operation The floating-point parallel mix (fmix) instruction can be used to form a new value of parallel data from a specified combination of the left (upper) or right (lower) parallel single-precision numbers of the source operands. The allowed formats of the instruction are given in Table 9.22. The left (.l) completer forms the destination operand in f1 from the 32-bit left elements, bit 32 through bit 63, of the source operands f2 and f3. The diagram in Figure 9.14(a) illustrates this operation. The resulting value in f1 is coded as parallel floating-point numbers.
Figure 9.14: Floating-point Parallel Mix Operations Example If the values of parallel data in Fr6 and Fr7 are 1003E0000007C0000007DH and 1003E0000007B0000007EH, respectively, what results are produced in Fr5 when the instruction that follows is executed? (p0) fmix.r f5=f6,f7 Solution This instruction employs the second format of the fmix instruction from Table 9.12. The operation it performs is shown in Figure 9.14(b). The right values of the source operands are: Fr6r = 0000007DH Fr7r = 0000007EH Mixing these values in the destination gives:
Fr5 = 01003E0000007D0000007EH
The third form of the mix instruction, fmix.lr, forms the new value in destination f1 from the left value in source f2 and the right value of source f3. This operation is summarized in Figure 9.14(c).
Floating-point Pack Operation Table 9.22 shows the format of the floating-point pack (fpack) instruction. When it is executed, the floating-point numbers in source operands f2 and f3 are converted to single-precision memory format and placed into the destination operand f1, as shown in Figure 9.15. The translation from 82-bit register format to single-precision memory format is illustrated in Figure 9.6(a). Remember that the 32-bit single-precision memory format includes the original sign bit and 23-bit significand value with a compacted 8-bit exponent formed from the biased exponent field and integer bit.
Figure 9.15: Floating-point Pack Operation Example The values in registers Fr6 and Fr7 are 10006FAC0000000000000H and 10005FAB0000000000000H. What result is produced in Fr6 by executing the following instruction? (p0) fpack f5=f6,f7 Solution Expressing the values in Fr6 and Fr7 in binary form, partitioning into fields, and eliminating lower-order fill 0s gives: Fr6reg = 10006FAC0000000000000H = 0 10000000000000110 1 111101011000000000000002 Fr7reg = 10005FAB0000000000000H
= 0 10000000000000101 1 111101010110000000000002 After converting to 32-bit memory form, the values are: Fr6mem = 0 1 0000110 111101011000000000000002 = 437AC000H Fr7mem = 0 1 0000101 111101010110000000000002 = 42FAB000H Now, packing these values into destination register Fr5 and coding as a parallel floatingpoint number, we get: Fr5 = 1003E437AC00042FAB000H
Floating-point Swap Operation The floating-point swap (fswap) instruction is used to process single-precision floatingpoint numbers that are packed in parallel in floating- point registers. The three allowed formats of the fswap instruction are given in Table 9.22. The operation performed by the first format is illustrated in Figure 9.16(a).
Figure 9.16: Floating-point Swap Operations
The following instruction is an example: (p0) fswap f5=f6,f7 Assuming that the data in Fr6 and Fr7 are 1003E437AC00042FD6000H and 1003E42FAC000C2FD6000H, respectively, executing the instruction causes the left single-precision value from Fr6 and right single-precision value from Fr7 to be combined together in Fr5. Partitioning the values in the source operands into left and right numbers gives: Fr6 = 1003E 437AC000 42FD6000H Fr6left = 437AC000H Fr7 = 1003E 42FAC000 C2FD6000H Fr7right = C2FD6000H Combining the values of Fr7right and Fr6left gives the following parallel floating-point result: Fr5 = 1003EC2FD6000437AC000H Figure 9.16(b) demonstrates the operation of the negate left (.nl) and negate right (.nr) forms of the instruction.
Floating-point Sign Extend Operation The last floating-point data arrangement instruction is floating-point sign extend (fsxt). Table 9.22 shows both forms of this instruction: sign extend left, which is specified by appending the completer .l to the mnemonic, and sign extend right, which is specified with the .r completer. The operations performed with these instructions are illustrated in Figures 9.17 (a) and (b).
Figure 9.17: Floating-point Sign Extend Operations As an example, let us determine what result is produced when the following instruction is executed: (p0) fsxt.r f5=f6,f7 Assume that Fr6 contains the value 1003E42FAC000C2FD6000H and that Fr7 contains 1003E437AC00042FD6000H. First, the fsxt.r instruction extracts the sign from the right single-precision floating-point value in Fr6. The right number is: Fr6right = C2FD6000H = 110000101111110101100000000000002 Therefore, the sign bit is 1 for negative. Then, the right number in Fr7 is extracted. This value is: Fr7right = 42FD6000H Extending the sign bit and combining with Fr7right in destination Fr5 gives: Fr5 = 1003EFFFFFFFF42FD6000H Notice that the result is expressed in 82-bit register format with the sign set to 0 and biased operand equal to 1003EH.
TeamUnknown Release
Chapter 9 - Floating-point Architecture Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
FLOATING-POINT COMPARE AND CLASS INSTRUCTIONS In our study of the integer instructions of the instruction set, we found that the compare and test instruction performed conditional tests of information, and based on the result of this test, the operation set or reset predicate registers. Predicate registers are also written by the following instructions: floating-point register compare (fcmp), floating-point class (fclass), floating-point reciprocal approximation (frcpa), and floating-point reciprocal square root approximation (frsqrta). We covered the frcpa and frsqrta instructions in the earlier section on floating-point arithmetic instructions. Here the operations performed by the fcmp, fpcmp, and fclass instructions are examined.
Floating-point Compare Instructions Two instructions perform floating-point comparison operations: floating-point register compare (fcmp), and floating-point parallel register compare (fpcmp). The fcmp instruction reads the values in source floating-point registers, performs a comparison test on these values, and based on the outcome of this comparison, writes two predicate registers. The general format of the floating-point compare instruction is: (qp) fcmp.frel.fctype.sf p1,p2=f2,f3 Just like the integer compare instruction, a variety of different comparison operations are defined for fcmp, and they are defined by the completers status field (.sf), floating-point relationship (.frel), and floating-point comparison type (.fctype). For example, the instruction could be written as follows: (p0) fcmp.eq p5,p6=f10,f11 Let us briefly look at the purpose of each of these completers. The .sf completer was used by earlier instructions and selects a status field from the FPSR. The choices for this completer are listed in Table 9.9. In our example, status field s0 is selected by not including a completer. The second completer .frel selects the test relationship that is performed on the source operands. Using .eq in the example tests to determine whether
the floating-point numbers in Fr10 and Fr11 are equal. The allowed floating-point comparison relationships are listed in Table 9.24. The last completer determines how the predicates are updated by the conditional test. Since no .fctype completer is included in the example, the results produced in predicate registers Pr5 and Pr6 correspond to the "none" row in Table 9.25. For instance, if the two floating-point numbers are equal, the results produced are Pr5 = 1 and Pr6 = 0. The AND, OR, and DeMorgan types that are provided for integer comparisons are not supported for the floating-point compare instruction. Table 9.23: Floating-point Compare Instructions Mnemonic
Operation
Format
Description
fcmp
Floatingpoint compare
(qp) fcmp.frel.fctype.sf p1,p2=f2,f3
The two source operands f2 and f3 are compared for one of twelve relations specified by frel. This produces a result which is 1 if the comparison condition is True, and 0 if False. This result is written to the two destination predicate registers p1 and p2. The compare type specified with fctype determines the way the result is written to the destinations.
fpcmp
Floatingpoint parallel compare
(qp) fpcmp.frel.sf f1=f2,f3
The two pairs of single precision source operands in the significand fields of f2 and f3 are compared for one of twelve relations specified by frel. This produces a result that is 32 1’s if the comparison condition is True, and 32 0’s if False. This result is written to a pair of 32-bit integers in the significand field of f1. 1003EH ® exponent field of f1 0 (positive) ® sign field of f1
fclass
Floatingpoint class
(qp) fclass.fcrel.fctype p1, p2=f2, fclass9
The contents of f2 are classified according to the category specified by the fclass9 immediate operand. Based upon whether the contents of source register f2 agree with the type
Mnemonic
Operation
Format
Description register f2 agree with the type floating-point number format specified by fclass9, as specified by the fcrel completer, a result is written to the two destination predicate registers p1 and p2. The results written to the destinations are determined by the compare type specified with fctype. Table 9.24: Floating-point Compare Relationships frel
Meaning
Compare relationship
eq
Equal
f2==f3
lt
Less than
f2
le
Less than or equal to
f2<=f3
gt
Greater than
f2>f3
ge
Greater than or equal to
f2>=f3
unord
Unordered
f2 ? f3
neq
Not equal
!(f2=f3)
nlt
Not less than
!(f2
nle
Not less than or equal to
!(f2<=f3)
ngt
Not greater than
!(f2>f3)
nge
Not greater than or equal to
!(f2>=f3)
ord Ordered Table 9.25: Floating-point Comparison Types
fctype
Meaning
Pr[qp] Pr1
None
Normal
unc
Unconditional
0
Pr2 0
!(f2 ? f3)
Pr[qp] ==1 Result ==0 no source NaTVals
Result ==1 no source NaTVals
One or more source NaTVals
Pr1
Pr2
Pr1
Pr2
Pr1
Pr2
0
1
1
0
0
0
0
1
1
0
0
0
Similar to the integer comparison relationships, not all of the floating-point comparison relationships are directly implemented in hardware. Some are actually pseudo-ops.
The fpcmp instruction compares each of the two independent values in source operand f2 to their corresponding element of data in source operand f3. The test performed as part of these independent compares is specified by one of the comparison relationships in Table 9.25. However, predicate registers are not updated based on the result; instead, special values identifying the result of each comparison are placed in the destination register specified by f1. If a comparison is True, 1s are written to the corresponding 32bit field of the significand in the destination register. If the comparison is False, 0s are written into the 32-bit field. The sign field is made 0 and the biased exponent field is 1003EH. For instance, suppose the following instruction is executed: (p0) fpcmp.lt f5=f10,f11 If the upper floating-point value in Fr10 is found to be less than its corresponding value in Fr11 and the lower value in Fr10 is greater than the corresponding value in Fr11, the parallel result produced in the destination is: Fr5 = 1003EFFFFFFFF00000000H If the value of a data element in either f2 or f3 is NaTVal, the corresponding element in f1 is set to NaTVal.
Floating-point Class Instruction The floating-point class (fclass) instruction can be used to classify a number that resides in a floating-point register. For example, it could determine whether or not the value in a floating-point register is a positive, normalized floating-point number. Table 9.23 shows that its format is given in general as: (qp) fclass.fcrel.fctype p1,p2=f2,fclass9 The classification operation that is performed on the value in source operand f2 is further defined by adding floating-point class relationship (.fcrel) and floating-point comparison type (.fctype) completers to the mnemonic. The choices for completers.fcrel and .fctype are listed in Tables 9.26 and 9.25, respectively. Table 9.26: Floating-point Class Relationships fcrel
Meaning
Test relation
m
Member
The type value in f2 agrees with the pattern specified by fclass9
nm
Not a member
The type value in f2 does not agree with the pattern specified by fclass9
Immediate operand fclass9 is used to specify the classification tested, as in the following sample instruction:
(p0) fclass.m p9,p10=f5, @nat Here the @nat immediate operand says that the value in source operand Fr5 is tested to determine whether or not it is NaTVal. The .m completer, which stands for "member," means that a True response results if there is a match between the type of the value in the source operand and the type specified with the classification operand. Finally, since no floating-point comparison type completer is appended, the results produced by the classification operation in predicate registers Pr9 and Pr10 correspond to the normal casein Table 9.25. Table 9.27: Floating-point Classes (fclass9) Options fclass9
Class
Mnemonic
These cases can be tested for either 0x0100
NaTVal
@nat
0x080
Quiet NaN
@qnan
0x040
Signaling NaN
@snan
Or, the OR of the following two cases 0x001
Positive
@pos
0x002
Negative
@neg
ANDed with the OR of the following four cases 0x004
Zero
@zero
0x008
Unnormalized
@unorm
0x010
Normalized
@norm
0x020
Infinity
@inf
As another example, let us rewrite our example instruction to tests to determine if the value in Fr5 is a member of the class of all normalized numbers. The class of all normalized numbers included all positive and negative-normalized numbers; therefore, the fclass9 operand implements the expression @pos OR @neg AND @norm and the instruction is written as (p0) fclass.m p9,p10=f5,@pos@neg@norm
TeamUnknown Release
Chapter 10 - IA-32 Compatibility and Application Execution Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Chapter 10: IA-32 Compatibility and Application Execution Up to this point, we have focused on the 64-bit elements of the Itanium architecture and their intended applications. The Itanium architecture is designed with the additional goal of preserving the existing investment in IA-32 software applications and infrastructure. For this reason, the Itanium architecture provides binary compatibility with the IA-32 instruction set in hardware. In this chapter, we cover the Itanium-based operating environments, the process that one follows to transition between Itanium-based and IA32 applications, and the mapping of the Itanium processor’s application register set to the IA-32 application register set.
OPERATING ENVIRONMENTS FOR ITANIUM™-BASED APPLICATIONS Since Itanium architecture includes binary compatibility with the IA-32 instruction set, the operating system of an Itanium-based computer can run a mixture of IA-32 and Itaniumbased software applications. Figure 10.1(a) shows this software environment. The Itanium-based operating system environment is the intended environment for applications on Itanium-based computer systems. To achieve best performance, both the operating system and the applications must be coded with the Itanium instruction set and tuned to take advantage of the enhancements in the Itanium architecture.
Figure 10.1: Compatibility with IA-32 Applications Broad operating system and application support is in place for servers and workstations based on the Itanium architecture. Many of the industry’s leading, high-end operating systems, such as Monterey UNIX †, Win64† from Microsoft, and Linux†, are planned to be available when Itanium-based computer systems ramp into production. Moreover, leading applications for these market segments are being rewritten to run directly on the Itanium architecture. Adding the ability to run IA-32 applications under the Itanium-based operating system permits access to the large, existing IA-32 software infrastructure. Thus, Itanium-based systems can run IA-32 and Itanium-based software applications transparently. In this way, a broader base of applications is immediately available on the Itanium architecture. The Itanium processor is also capable of operating in a dedicated IA-32 system environment. However, software working in this environment cannot benefit from the architectural advancements available in the processor. Figure 10.1(b) illustrates a dedicated IA-32 software environment. This environment provides compatibility with the Intel Pentium III processor family and executes the IA-32 instruction set. For this reason, operating systems and application programs written for the IA-32 protected mode, real mode, or virtual 8086 mode can be directly run by the Itanium processor. For instance, the Windows† 3.1, Windows 95/98, Windows NT†, and UNIX operating systems and their applications can run directly. However, they run at a performance level comparable to that achieved with the current IA-32 family processors, not at the level of performance of an application that is coded for the Itanium architecture. Table 10.1 summarizes the operating environments supported by the Itanium processors. Table 10.1: Itanium Processor Operating Environments System environment
Application environment
Usage
Itanium processor
Itanium instruction set
Itanium-based applications on Itaniumbased operating systems.
System environment
IA-32
TeamUnknown Release
Application environment
Usage
IA-32 protected mode
IA-32 protected mode applications in the Itanium-based system environment, if supported by the OS.
IA-32 real mode
IA-32 real mode applications in the Itanium system environment, if supported by the OS.
IA-32 virtual 8086 mode
IA-32 virtual 8086 mode applications in the Itanium-based system environment, if supported by the OS.
IA-32 instruction set
IA-32 protected mode, real mode, and virtual 8086-mode application and operating system environments. Compatible with IA-32 Pentium, Pentium Pro, Pentium II, and Pentium III processors.
Chapter 10 - IA-32 Compatibility and Application Execution Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
INVOKING IA-32 APPLICATIONS Now, we know that the operating system of an Itanium-based computer can run Itaniumbased and IA-32 software applications concurrently and that these applications are coded using instructions from the Itanium instruction set and IA-32 instruction set, respectively. Actually, the operating system runs these applications in a time-shared manner. Therefore, it can execute either IA-32 or Itanium instructions at any one time. So, how does the Itanium processor switch between execution of Itanium instructions and IA-32 instructions? The architectural mechanism that performs this switching operation is known as the instruction set transition model. This mechanism can be disabled by the operating system to implement an Itanium-based only execution environment.
Instruction Set Transition Model Figure 10.2 illustrates the instruction set transition model. When operating in the Itanium environment, the operating system initiates transition to an IA-32 application by performing an invoke IA-32 instruction set branch (br.ia) operation. The operating system initiates the return from the IA-32 application to the Itanium-based execution environment with a jump to Itanium (jmpe) instruction. This branch and jump process is the primary mechanism used to run IA-32 applications from an Itanium-based operating system.
Figure 10.2: Instruction Set Transition Model
A second transition process is identified in Figure 10.2. Notice that a transition back to the Itanium-based execution environment is initiated by interruptions, such as hardware interrupts, software interrupts, and exceptions, that occur in the IA-32 application. In this way, we see that exception conditions are serviced by the operating system. At completion of the exception handler, an Itanium-based system instruction called return from interruption (rfi), initiates the return to IA-32. Let’s look more closely at the branch and jump operations involved in the instruction set transition process. Execution of the br.ia instruction initiates an unconditional branch by the Itanium-based operating system to an IA-32 application. The steps in the transition process include switching to the IA-32 instruction, branching to a target instruction in the IA-32 environment, and executing interface code to initialize an IA-32 protected mode, real mode, or virtual 8086-mode environment. The instruction set control bit (is) in the Itanium processor’s status register specifies the currently executing instruction set. This bit must be logic 0 to execute an Itanium-based application and 1 to run an IA-32 application. The state of this bit is switched back and forth between 0 and 1 as part of the instruction set transition model.
Invoking the IA-32 Instruction Set Table 10.2 gives the format of the invoke IA-32 instruction set branch instruction. Since this jump operation is always taken, it is predicated with Pr0, which is 1. If p0 is not used as the qualifying predicate, execution of the instruction results in an exception. The static taken (.sptk) branch hint is appended to improve the performance of the transition process. Therefore, an example of the instruction is written as: Table 10.2: Invoke IA-32 Branch Instruction Format Mnemonic
Format
Operation
Description
br
(qp) br.btype.bwh.ph.dh b2
b2 ® IP 31:0
Initiates a conditional
0 ® IP 63:32
branch operation
PSR.is ® 0 (Where qp = p0, btype = ia, bwh = sptk, ph = none, dh = none, and b2 = Br0 through Br7) (p0) br.ia.sptk b5 The br.ia instruction only performs the first two steps of the transition process. First, it selects the IA-32 instruction set by clearing the is bit in the processor status register to 0. Then, an indirect branch is performed using the address pointer contained in the lower 32 bits of branch register Br5. That is, the value in Br5 is loaded into the lower 32 bits of
the instruction pointer, the upper 32 bits of the IP are cleared, then the address of the first IA-32 instruction in memory is calculated as CS:EIP. The instruction sequence that starts at this address must initialize the IA-32 application environment. Operating system software must make sure that the IA-32 code segment descriptor (CSD) and code segment base (CS) address are loaded and that all dirty registers in the Itanium processor’s register stack are saved to the backing store before branching to IA32 execution. Any dirty registers left in the current or prior register stack frames may be modified by the IA-32 application. The register stack is flushed to the backing store by executing the flushrs instruction.
Returning to the Itanium Instruction Set Earlier we explained that the return from an IA-32 real mode, protected mode, or virtual 8086-mode application is initiated by the operating system executing a jmpe instruction. The format and operation of this instruction is summarized in Table 10.3. Notice that the value that is to be loaded into the IP can be specified indirectly as the value in a register or storage location in memory or as an immediate displacement. The following instruction is an example: jmpe EDX Table 10.3: Jump to Itanium Instruction Set Mnemonic
Format
Operation
Description
jmpe
jmpe r/m16
[regptr/memptr16+CS] ® IP 15:00 ® IP 63:16PSR.is ® 0Address of next instruction® Gr1EFLAGS.rf ®0
Jump to Itanium instruction set, indirect address specified by regptr/memptr16
jmpe r/m32
[regptr/memptr32+CS] ® IP 31:00 ® IP 63:32PSR.is ® 0Address of next instruction® Gr1EFLAGS.rf ®0
Jump to Itanium instruction set, indirect address specified by regptr/memptr32
jmpe disp16
Disp16 CS ® IP+15:00 ® IP 63:16PSR.is ® 0Address of next instruction® Gr1EFLAGS.rf ® 0
Jump to Itanium instruction set, absolute address specified by disp16
jmpe disp32
Disp32 CS ® IP+31:00 ® IP 63:32PSR.is ® 0Address of next instruction® Gr1EFLAGS.rf ® 0
Jump to Itanium instruction set, absolute address specified by disp32
Since a 32-bit register is identified as holding the address, this value represents a return from a protected mode application. When this instruction is executed, the address loaded into the lower 32 bits of the IP equals the value in EDX plus the current value of the CS base address, while the upper 32 bits are cleared to 0. Then, the is bit in the PSR is returned to 0 to reselect the Itanium instruction set, which starts execution at the new bundle address in the IP. Finally, the address of the instruction following the jmpe instruction is saved in general register Gr1 and the resume flag (rf ) bit in EFLAGs is cleared.
TeamUnknown Release
Chapter 10 - IA-32 Compatibility and Application Execution Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
IA-32 APPLICATION REGISTER SET Earlier, we pointed out that the IA-32 execution environment initiated by an Itaniumbased operating system provides a Pentium III processor application architecture and instruction set. The IA-32 application register set includes the general purpose registers, segment selectors, segment descriptors, floating-point registers, MMX technology registers, SIMD extension registers, and special registers such as EFLAGs, status registers, and control registers. The IA-32 register model is entirely contained within the larger Itanium-based application register set. This subset of the Itanium processor’s register set is accessible by both IA-32 and Itanium instructions. This section looks at how the IA-32 registers are mapped to the Itanium processor’s register set. Table 10.4 is a summary of the mapping of IA-32 registers to the Itanium processor’s register set. The leftmost column in this table is a list of every register in the Itaniumbased application register set. They are grouped by Itanium processor’s register types. Notice that the first group in the list consists of the 128 general-purpose registers, and the last group is the application registers. The second column identifies an IA-32 register if one is mapped to the Itanium processor’s registers. For example, Itanium architecture general registers Gr8 through Gr11 act as general-purpose registers EAX, ECX, EDX, and EBX, respectively. The fourth and fifth columns identify the IA-32 size and function of the register. Table 10.4: IA-32 Register Mapping Itanium register
IA-32 register
Convention
Size
Description
General purpose integer registers Gr0
Constant 0
Gr1–3
Modified
Scratch for IA-32 execution
Gr4–7
Unmodified
IA-64 preserved registers
Gr8
EAX
IA-32 state
32
IA-32 general purpose
Itanium register Gr8
IA-32 register EAX
Convention IA-32 state
Size 32
Gr9
ECX
IA-32 state
32
Gr10
EDX
IA-32 state
32
Gr11
EBX
IA-32 state
32
Gr12
ESP
IA-32 state
32
Gr13
EBP
IA-32 state
32
Gr14
ESI
IA-32 state
32
Gr15
EDI
IA-32 state
32
Description IA-32 general purpose registers IA-32 general purpose registers IA-32 general purpose registers IA-32 general purpose registers IA-32 general purpose registers IA-32 general purpose registers IA-32 general purpose registers IA-32 general purpose registers
Gr16{15:0}
DS
IA-32 state
64
IA-32 selectors
Gr16{31:16}
ES
IA-32 state
64
IA-32 selectors
Gr16{47:32}
FS
IA-32 state
64
IA-32 selectors
Gr16{63:48}
GS
IA-32 state
64
IA-32 selectors
Gr17{15:0}
CS
IA-32 state
64
IA-32 selectors
Gr17{31:16}
SS
IA-32 state
64
IA-32 selectors
Gr17{47:32}
LDT
IA-32 state
64
IA-32 selectors
Gr17{63:48}
TSS
IA-32 state
64
IA-32 selectors
Gr18–23 Gr24
Modified ESD
Gr25–26
IA-32 state
Scratch for IA-32 execution 64
Modified
IA-32 segment descriptors (register format) Scratch for IA-32 execution
Gr27
DSD
IA-32 state
64
Gr28
FSD
IA-32 state
64
IA-32 segment descriptors (register format)
Itanium register Gr28
IA-32 register FSD
Convention IA-32 state
Size 64
Gr29
GSD
IA-32 state
64
Gr30
LDTD
IA-32 state
64
Gr31
GDTD
IA-32 state
64
Description IA-32 segment descriptors (register format) IA-32 segment descriptors (register format) IA-32 segment descriptors (register format) IA-32 segment descriptors (register format)
Gr32–127
Modified
IA-32 code execution space
Process environment IP
IP
IA-32 state
64
Shared IA-32 and Itanium architecture virtual instruction pointer
Floating-point registers Fr0
Constant +0.0
Fr1
Constant +1.0
Fr 2–5
Unmodified
Itanium processor preserved registers
Fr6–7
Modified
IA-32 code execution space
Fr8
MM0/FP0
IA-32 state
64/80
Fr9
MM1/FP1
IA-32 state
64/80
Fr10
MM2/FP2
IA-32 state
64/80
Fr11
MM3/FP3
IA-32 state
64/80
Fr12
MM4/FP4
IA-32 state
64/80
Fr13
MM5/FP5
IA-32 state
64/80
Fr14
MM6/FP6
IA-32 state
64/80
Fr15
MM7/FP7
IA-32 state
64/80
Fr16–17
XMM0
IA-32 state
64
Fr18–19
XMM1
IA-32 state
64
IA-32 MMX technology registers (aliased on 64-bit FP mantissa);IA-32 FP registers (physical registers mapping)
IA-32 Streaming SIMD Extension register. Low order 64-bits of XMM0 are
Itanium register
IA-32 register
Convention
Size
Fr20–21
XMM2
IA-32 state
64
Fr22–23
XMM3
IA-32 state
64
Fr24–25
XMM4
IA-32 state
64
Fr26–27
XMM5
IA-32 state
64
Fr28–29
XMM6
IA-32 state
64
Fr30–31
XMM7
IA-32 state
64
Fr32–127
Modified
Description order 64-bits of XMM0 are mapped to Fr16{63:0} . High order 64-bits of XMM0 are mapped to Fr17 {63:0}
IA-32 code execution space
Predicate registers Pr0 Pr1–63
Constant 1 Modified
IA-32 code execution space
Br0–5
Unmodified
Itanium preserved registers
Br6–7
Modified
IA-32 code execution space
RSC
Unmodified
BSP
Unmodified
Not used for IA-32 execution
BSPSTORE
Unmodified
RNAT
Unmodified
Branch registers
Application registers
Not used for IA-32 execution Itanium architecture preserved registers Itanium architecture preserved registers
CCV
Modified
64
IA-32 code execution space
UNAT
Unmodified
Not used for IA-32 execution, Itanium architecture preserved
FPSR.sf0
Unmodified
Itanium architecture numeric status and controls
Itanium register
IA-32 register
FPSR.sf1,2,3
FSR
Convention
Size
Modified
FSW,
Description IA-32 code execution space, modified during IA32 execution
IA-32 state
64
IA-32 numeric status and tag word and streaming; SIMD extension status
IA-32 state
64
IA-32 numeric and streaming SIMD extension control
IA-32 state
64
IA-32 x87 numeric environment opcode, code selector, and IP
FTW, MXCSR FCR
FCW, MXCSR
FIR
FOP, FIP, FCS
FDR
FEA, FDS
IA-32 state
64
IA-32 x87 numeric environment data selector and offset
ITC
TSC
Shared
64
Shared IA-32 time stamp counter (TSC) and Itanium architecture interval timer
PFS
EFLAG
Unmodified
32
Not used for IA-32 code execution, prior EC is preserved in PFM
LC
Unmodified
EC
Unmodified
EFLAG
Unmodified
CSD
CSD
IA-32 state
SSD
SSD
IA-32 state
Itanium architecture preserved registersIA-32 system/arithmetic flags, writes of some bits conditioned by CPL and EFLAG.iopl 64
IA-32 code segment (register format) IA-32 code segment (register format)
CFLG
CRr0/CR4
64
IA-32 control flags CR0 = CFLG{31:0}, CR4 = CFLG{63:32}, writeable at CPL = 0 only
Itanium register
IA-32 register
Convention
Size
Description
Column 3 in Table 10.4 is identified as the convention. The information in this column tells how the Itanium architecture registers are impacted by the IA-32 execution environment. Here each register or group of registers is identified as unmodified, modified, or IA-32 state. The Itanium architecture registers that are redefined to serve as IA-32 application registers are identified by “IA-32 state” in the convention column. Examples are Itanium architecture floating-point registers Fr16 through Fr31, which serve as the SIMD extension registers for IA-32 applications. Even though many of the Itanium architecture registers are not used during IA-32 execution, the contents of some of them are modified when IA-32 applications are run. For this reason, registers in column 3 that do not represent an IA-32 application register are identified with either unmodified or modified. Itanium architecture registers identified as modified are used as scratch registers and might hold data temporarily during the execution of an IA-32 application. For example, Itanium architecture general registers Gr1 through Gr3 are used as scratch registers and identified as “modified” in Table 10.4. Therefore, the values that they contain from the prior Itanium-based operating environment are not preserved, and if those values are needed, the registers’ contents must be saved and restored as part of the transition process between the Itanium-based environment and the IA-32 execution environment. Information in a register identified as “unmodified” is not affected by the IA-32 application. For instance, in Table 10.4 the RSC, BSP, BSPSTORE, and RNAT application registers are labeled as unmodified. The values they contain remain valid, and those values can be used upon return to the Itanium-based execution environment.
TeamUnknown Release
Chapter 11 - Compiler Technology for the Itanium-based Applications Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Chapter 11: Compiler Technology for the Itanium-based Applications The Itanium architecture was specifically designed with compilers in mind. Compilers are a critical element in implementing high performance application code. They identify and exploit opportunities for parallel execution in the application code, revise and reorder code for parallel execution, and communicate information via hints and instructions to the processor to facilitate high performance. The process by which the compiler modifies application code so that it performs better is known as optimization. Many vendors offer various Itanium compilers to compile different programming languages and support various operating systems. This chapter explains the internals of the Intel® compilers. This explanation serves as an overview of the things a compiler for the Itanium processor can be expected to do. Designs for compilers from other companies will undoubtedly vary, but the concepts will be similar, as will be many of the solutions. Information on the Intel compilers can be found at www.intel.com and developer.intel.com/vtune on the web.
ARCHITECTURE OF THE INTEL COMPILER Intel provides C++ and Fortran compilers for the Itanium processor. These compilers utilize the advanced features of the Itanium architecture to maximize application performance by: Decreasing overhead in memory accesses. Reducing the number of branches and the penalty overhead associated with branches. Increasing instruction level parallelism. Memory access overhead is decreased through speculation and the use of the large
number of registers. Delays due to branching are reduced by the use of predication. Instruction level parallelism is increased by the scheduling techniques that consider large amounts of a program during code scheduling. Familiar optimizations used widely in other compilers have been extended in the Itanium compiler to employ the features and resources of the architecture. Furthermore, new optimization techniques have been developed and used in the compiler to fully extract the parallelism of the architecture. Figure 11.1 outlines the architecture of the Intel compilers.
Figure 11.1: Compilation Process
TeamUnknown Release
Chapter 11 - Compiler Technology for the Itanium-based Applications Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
PROFILE OF AN APPLICATION PROGRAM A compiler can take better advantage of the architectural features of Itanium processor in an application if it understands the execution behavior of the program. Information about key execution properties of a program can be collected by instrumenting, then running, the program. Of special interest is information about which functions are called and how often each branch is taken. This collected information is referred to as the profile of the program and is used by the compiler to make better decisions about how to optimize the program. The first step in the compilation process (see Figure 11.1) is profiling of the application program. The Intel compilers support both static and dynamic profiling. As its name implies, static profiling means that the profile of the execution of the program is estimated without actually running the program. When the compiler is used with static profiling, it estimates the frequency of execution of the functions and the probability that branches are taken based on known characteristics of a typical program, not the specific application. A dynamic profile is generally more accurate since it is based on information collected during runs of the application program, and it therefore more accurately reflects the actual operation of the program. All compilers support static profiling. Many compilers have supported dynamic profiling for other processors, but none have had an architecture designed to take advantage of dynamic profile information the way the Itanium architecture can. The software developer makes the choice of static or dynamic profiling at compile time. Over time, as familiarity grows and additional tool support is available, dynamic profiling will become more commonly the preferred method of compilation.
TeamUnknown Release
Chapter 11 - Compiler Technology for the Itanium-based Applications Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
DYNAMIC PROFILE-GUIDED COMPILATION Dynamic profiling, also known as profile feedback, is a method for gathering dynamic program control flow information about the operation of an application program. Dynamic profile-guided compilation is a three-step process as illustrated in Figure 11.2. The first step is instrumented compilation. This operation corresponds to an initial compile of the application code during which it is prepared to collect profile information. The output of this step is a modified version of the application program called instrumented code. Next the instrumented code is run, one or more times, with typical input data sets to gather profile information about its operation. This step of the profile process is instrumented execution. The collected profile results are output as files.
Figure 11.2: Dynamic Profile-guided Compilation The final step, feedback compilation, is used to combine the profiles and annotate the application program with this information. As the compiler continues to process the program, the optimizing routines read this profile information in the application program and use it to guide the optimization process.
TeamUnknown Release
Chapter 11 - Compiler Technology for the Itanium-based Applications Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
INTERPROCEDURAL ANALYSIS AND OPTIMIZATION The second step in the compilation process of Figure 11.1 is interprocedural analysis and optimization. For the most part, compilers have operated on one function at a time. To fully exploit the architectural features of the Itanium processor, the compiler must gather information about the behavior of multiple functions over a large region of application code. The Intel compiler’s interprocedural optimizer has the ability to perform wholeprogram analysis. Analysis of the whole program exposes opportunities to optimize across functions in the complete program. In this way, the compiler receives the most comprehensive information with which to optimize the program. A characteristic of a good optimizer is the ability to accurately predict which pointers in a program can point to the same memory locations. Loads that definitely point to the same location can be reduced to a single load and the value can be held in a register, something a large register set helps make possible. Loads and stores that definitely do not point to the same memory can be freely rearranged to maximize parallelism. Loads and stores, which may occasionally point to the same memory, normally inhibit code motion and thereby limit instruction level parallelism. The availability of the many forms of speculative loads allows the application programmer to overcome this significant barrier to performance. The Intel compilers use interprocedural analysis to extend what the compiler is considering when computing information about pointers. In other words, the compiler examines what functions do with memory, and exploits the information when compiling the rest of the program. Thus, values loaded from memory can remain in registers across function calls when it is known that the function called will not change the memory from which the register data came. Optimizations such as this are only possible when the compiler has computed interprocedural information.
TeamUnknown Release
Chapter 11 - Compiler Technology for the Itanium-based Applications Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
MEMORY OPTIMIZATION, PARALLELIZATION, AND VECTORIZATION The Intel compiler uses a variety of memory optimization, parallelization, and vectorization techniques to reduce the impact of memory latency on the performance of the compiled program. The compiler focuses on eliminating memory accesses by holding values in registers and cache, and by getting them loaded into registers or cache in advance so as to minimize any delays in accessing the data when it is needed. The compiler’s cache optimizations are focused on making the code effectively use the available cache memory. For instance, the compiler may exploit temporal locality by using the level 1 load hints on frequently used memory access instructions. Correct use of this completer allows data references to incur much smaller overhead. The compiler also uses techniques specifically designed for loops, such as linear loop transformations, loop fusion, loop distribution, and loop-block-unroll-and-jam, to restructure loop operations to run faster on the Itanium architecture. The Intel compilers perform vectorization and parallelization optimizations. Loop vectorization techniques improve the performance of floating point applications. For instance, the number of memory references can be reduced by a factor of two by using load-pair instructions. Multiply and accumulate operations that are common to vector computations can be performed efficiently with the fused multiply-add instruction. The compilers support OpenMP, which is an industry standard for specifying shared memory parallelism.
TeamUnknown Release
Chapter 11 - Compiler Technology for the Itanium-based Applications Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
GLOBAL SCALAR OPTIMIZATION The next step, shown in Figure 11.1, is global scalar optimization. The objective of this optimization technique is to improve the performance of an application program by minimizing the number of computations and the number of memory references. The techniques used for scalar optimization in the Intel compiler are partial redundancy elimination and partial dead store elimination. Partial redundancy elimination has the ability to remove redundancies that apply globally (across the whole application program) as well as those that occur only on some control paths. It permits removal of fully redundant and partially redundant elements of code and movement of loop invariant code elements outside of the loop. For example, the compiler removes a fully redundant element of code that performs a computation by saving the result produced by the code in a temporary variable. Later references to this code element in the program use the value of the temporary variable instead of repeating the computation. Traditional partial redundancy elimination is extended in the compiler for the Itanium architecture with control and data speculation. Including speculation capability improves its ability to remove redundant loads. Use of control speculation permits removal of a redundant load on one control path, perhaps at the expense of another, less important control flow path. Removal of a redundant load can sometimes be inhibited by an intervening store. If the compiler can determine that there is an unknown, but small, probability that the load instruction and subsequent store instruction may access the same memory location, data speculation may be used to remove the redundant load. In both cases, a check operation must replace the redundant load to assure that its removal does not produce an error. In contrast to partial redundancy elimination, which removes redundant loads, partial dead store elimination removes redundant stores. Knowing that a value does not need to be written to memory allows the compiler to eliminate the store to memory, resulting in improved performance.
TeamUnknown Release
Chapter 11 - Compiler Technology for the Itanium-based Applications Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
CODE GENERATION AND SCHEDULING The final step (see Figure 11.1) in compiling a program is to pick the actual instructions for the machine to execute. Since the Itanium architecture relies on the compiler to express parallelism in the code it generates, the process of selecting instructions to execute in parallel is critical. The part of a compiler that selects the instructions to be run in parallel is called the code scheduler. There are other critical steps in selecting the right code to schedule. These other steps are concerned with using features like predication, speculation, parallel compares, and rotating registers. A common theme in the code generator is determining the most likely paths through a program and emphasizing optimizing these paths when choices must be made which favor one path or another. The path determination process involves the formation of key regions to optimize. We will mention region formation as it pertains to several of the optimizations. The code generator consists of multiple phases occurring in the order shown in Figure 11.3. The first phase, translation, converts the optimizer’s intermediate representation of the program into the code generator’s intermediate representation. Translation is needed because the intermediate representation that is best suited for the optimizer differs from that which is best suited for the code generator. It turns out to be best to use different representations of the program being compiled (intermediate representation) and spend a little compilation time translating between the stages.
Figure 11.3: Code Generator Compilation Order Predicate region formation, if-conversion, and compare generation occur in the predication phase. The code generator contains two schedulers: the software pipeliner for loops and the global code scheduler for everything else. Both schedulers make heavy use of control and data speculation. The software pipeliner also utilizes rotating registers, predication, and loop branches to generate efficient schedules for integer as well as floating point loops. The code generator’s register allocator must handle several issues specific to Itanium architecture. These include NaT bit maintenance during spill/fill, ALAT awareness for data speculative registers, correct rotating register allocation for software pipeliner, and predicate awareness. The remainder of this chapter describes each of the major code generator phases in more detail.
Software Pipeliner Software pipelining is an important, but very specialized optimization phase that considers only regions of code known as loops. Software pipelining improves the performance of a loop by overlapping the execution of several iterations. The Itanium architecture provides extensive support for software-pipelined loops, such as register rotation, and special loop branches and registers. These features enable efficient software pipelining of loops, without the accompanying increase in code size seen in other architectures.
The software pipeliner schedules instructions in regions made up exclusively of the set of basic blocks in a loop. If the loop has multiple basic blocks, it is converted into a loop consisting of a single basic block using if-conversion as described in the following section on predication. The software pipeliner in the code generator utilizes advanced techniques known as data speculation, back substitution, and riffling to enable pipelining of many loops. Chapter 7 covers software pipelining in more detail. The Intel compilers provide softwarepipelining support in the code generator to take advantages of the many opportunities the architecture provides.
Instruction Predication Predication is one of the key features of the Itanium architecture. Using predication, the compiler is able to merge the execution of multiple paths. This action increases instruction-level parallelism by removing the penalty of mispredicted branches and nonsequential control flow in pipelined regions. Predication increases code motion freedom by allowing instructions to be moved upward across branches in a non-speculative manner and to be pushed downward into subsequent join blocks. The predicator decides which conditional branches to convert to use predicates. Generally, such branches are caused by “if” statements. The predicator first forms predicate regions, then if-conversion is done within a predicate region. The goal of region formation and if-conversion is to select a predicate region where the total number of static branches within the predicate region will be reduced after if-conversion, thereby reducing performance lost through branching. When using only static dynamic profile feedback, the algorithm for selection of code for the predicate region focuses on the availability of processor resources and the compatibility of individual critical paths. The algorithm avoids including basic blocks in the predicate region if they cause processor resource over-subscription or if they significantly increase the critical path through the region. With dynamic profile feedback, the selection criteria are extended to include the cost of branch misprediction and the weight of individual critical paths. The algorithm focuses on the branches that produce the most misprediction penalties and chooses the surrounding blocks to form a predicate region. Path collapsing merges the control flow paths within the predicate region into a minimal set of control flow paths. The Intel compilers consider heavy use of many predicates, and then capitalize on opportunities to minimize the number of different predicate values that are actually
computed in the resulting program. Using predicates incorrectly (excessively) can easily degrade performance instead of improving it. As shown in Figure 11.4, the compiler reduces the total number of control-flow paths from three to two using predication. Two tasks are accomplished by merging the controlflow paths. First, the unbiased conditional branch is eliminated, resulting in a highly biased conditional branch. Second, a larger basic block is formed from otherwise small basic blocks. This formation offers more opportunity for ILP to fill up the issue bandwidth and to hide long latency instructions.
Figure 11.4: Impact of Predicated Execution on Branch Prediction The code generator contains a relational database known as the predicate query system that is accessed by later phases of the code generator. It contains information like predicate disjointness, predicate dominance and post-dominance, predicate promotion, and predicate addition and subtraction. Without accurate predicate information, scheduling, software pipelining, and register allocation are forced to make conservative decisions that would result in sub-optimal code.
Global Code Scheduler The global code scheduler schedules code not scheduled by the software pipeliner. The global code scheduler has been designed to exploit the architectural features of the Itanium processor while being sensitive to practical considerations such as code size and compile time.
Compensation Code
Compensation is duplicated code that must be inserted when moving code up across control flow joins or down below control flow splits. This is a difficult and critical step in generating code for the Itanium architecture. The algorithm utilized by the Intel compilers involves scheduling the code in what is known as wavefront scheduling and deferring the generation of compensation code until other code has been scheduled. This scheduling method is called wavefront scheduling because we visualize instructions to be scheduled as being partitioned between scheduled and unscheduled code—with the wavefront defining the separation. Assuming top-down scheduling, the nodes above the wavefront have already been scheduled, and the schedule in these nodes will not be changed, unless one resorts to backtracking. No code has been scheduled into any node below the wavefront. In a top-down scheduling scheme, the wavefront first passes through the region entry blocks. When code scheduling into a block is completed, the block is declared closed, and the wavefront is advanced across it. This block now lies above the wavefront and represents a fully scheduled block. Thus, the wavefront advances down the region until it finally passes through all the exit nodes in the region. When a block B is closed, any unscheduled instructions from B, or above, are implicitly moved below B to be scheduled later. Hence before declaring a block closed, the scheduler examines both the correctness and the profitability of such downward code motion. When performing an upward or downward code motion that requires code duplication, the global code scheduler postpones the generation of this compensation code until it is actually scheduled. In so doing, the global code schedule provides scheduling freedom for the compensation code since its destination block is not fixed a priori. This practice also ensures that only one copy of the instruction is executed on any path through the region.
Speculation and Predication The speculation support in the Intel compiler allows many dependencies to be ignored when considering which instructions to schedule. Issues of control dependencies and data dependencies, which normally hinder attempts to find instruction level parallelism in a program, can be broken using speculation. Code is scheduled in an order based on a cost-benefit analysis. This analysis takes into consideration, among other things, an instruction’s global critical path length, its resource requirements, and its speculation and code duplication costs. Speculation-check instructions and recovery code are generated as a byproduct of speculation. Load safety information is used to avoid unnecessary control speculation. The global code scheduler uses predication support in the Itanium architecture to convert a control speculative instruction to a non-speculative one. Sometimes an instruction
being moved across a branch is scheduled after the compare operation that controls the branch. If the predicate produced by the compare is available for use, the instruction is predicated. This practice eliminates the need for a check instruction and recovery code. In addition, when the predicate is false, the adverse effects of a speculative operation on the data cache and TLB are avoided. Instruction predication, unpredication, and predicate promotion are achieved on the fly while scheduling. When an instruction is moved across multiple branches, the predicate controlling execution of the block of origin may not be available because the compare generating the predicate has not been scheduled or the compare’s latency has not yet expired. However a predicate for an intermediate block in the control dependence chain may be available. Predication with such a predicate will not render the instruction nonspeculative, but it will reduce the speculativeness of the instruction when the instruction executes. That is, the qualifying predicate is True. To do this effectively, the global code scheduler maintains a predicate promotion list and keeps track of predicates that become available as compare instructions are scheduled. This list is essentially a form of control dependence information, but it resides in the predicate domain. The predicate promotion list is also used when control speculating predicated instructions. The global code scheduler speculates these instructions by promoting the qualifying predicate to an available predicate in the predicate promotion list. If no predicate is available, the instruction is unpredicated.
Downward Code Motion Doing only upward code motion is not sufficient. The global code scheduler benefits from downward code motion in several ways. First, operations that cannot be speculated, such as stores and speculation check instructions, tend to stay in their block of origin. (They are not moved across branches.) It is advantageous to move these instructions downward if they do not fit into the block’s schedule. Second, downward motion can be used to empty a block, which can eliminate an unconditional branch or expose an opportunity for multi-way branch generation. To do this effectively, the global code scheduler monitors and updates block layout during scheduling. Finally, downward code motion helps reduce the amount of speculation or compensation code needed to expose instruction level parallelism. To move stores down to a join node, those stores must be predicated. Predicates are only available for use in blocks dominated by the compare instructions that generate them. This requirement places a limit on downward code motion.
Global Register Allocation In general, the compilers postpone decisions about which registers to use until code
scheduling has been completed. Thus, the final step in generating the code is to figure out which registers to use. The global register allocator (GRA) in the Intel compiler is a region-based Chaitin/Briggs style graph-coloring scheme. Special consideration for Itanium architecture features must be taken into account in the design: Advanced Loads–To minimize ALAT set conflicts, the register allocator must be aware of the mapping between registers and ALAT entries. Knowing this mapping, it is possible to allocate two registers within the same class (static, stack, or rotating) so as not to create a conflict. For maximum flexibility, the advanced load targets are allocated first. NaT Bits–Each general register has an associated NaT bit. When the register is spilled, the NaT bit is stored in the UNAT application register. The bit location is determined by the spill address. To obey this association, the register allocator spills those registers that might contain speculated values into contiguous memory. In addition, since the UNAT is a 64-bit register, after 64 registers are spilled, the UNAT itself must be spilled. This mechanism requires extra bookkeeping for the register allocator. Rotating Registers–Traditional register allocators do not support register rotation. However live ranges that span multiple stages of pipelined loops can benefit from the use of rotating registers. These live ranges are allocated by a special allocator within the software pipeliner. Then, the remaining live ranges are allocated by the global register allocator. Predication–In the presence of predication, “liveness” of a virtual register can be a function of multiple predicates, since the qualifying predicate of an instruction guards its definitions and uses. Traditionally, liveness is represented as a bit-vector. However, in our predicate-aware register allocator, liveness is represented as a bit-matrix with predicates being the additional dimension.
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index A absolute difference instructions, 173, 187–188 absolute value instructions, 235–236 acquire semantics, 73, 89 add instructions fetch and add immediate, 88, 90–91 fixed-point multiply add, 254–255 floating-point, 237–239, 242 floating-point multiply add, 240–242, 253–254 floating-point negative multiply add, 240–242 integer, 81, 102–104 parallel, 172–176 parallel shift left/right and add, 173, 178–180 predicated, 124 shift left and add, 102, 105 add operations modulo, 173–174, 175–176 pseudo, 103–104 saturated, 173–176, 178 add pointer, 109, 110 shift left and add pointer, 109, 111–112 add pointer instructions, 109–112 address pointers, 32 address space, virtual, 22 advanced load, 13, 14, 67, 68–69, 85 ambiguous memory dependencies and, 71–72 EPIC, xxxiii GRA and, 295 advanced load check, 14, 86
ALAT (advanced load address table), 13–14 data speculation and, 71–72, 85 GRA and, 295 algorithms, 242–247, 248, 256 web addresses, 244, 248, 256 allocate stack frame operations, 124, 145–146, 152 ambiguous memory dependencies, 57–58 load/store instructions, 71, 83, 86 AND complement instructions floating-point, 256–257 integer, 106, 107 AND instructions floating-point logical, 224, 225, 256–258 integer, 81, 105–106 AND-type comparisons, 129, 268 ANSI/IEEE Standard for Binary Floating-point Arithmetic, 203 application memory addressing model, 22–27 application registers, 29–31, 40–47, 55 execution unit types, 98 IA-32 mapping, 278–282 arithmetic instructions floating-point, 234–254, 267, 287 integer, 81, 102–104 parallel, 172–190 assembly language instructions, 53–56 atomic load, 88 average instructions parallel, 176–178 parallel subtract, 173, 181–182
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index B base-address-register update addressing, 69, 77 biased exponent in floating-point numbers, 203 biased load, 68–69 big-endian memory byte order, 24, 26–27, 43, 47–48 branch barrier condition, 79–83 branch cache deallocation hint, 137, 138 branch conditions, 135 branch instructions, 54, 134–137 branch hints, 137–138 branch return, 152 call branch, 135 conditional branch, 135, 136–137 conditional function call, 151 counted branch, 124, 135 loop, 160 parallel compares and, 142–144 set branch, 275, 276–277 branch-mispredict delays, 8–9 branch predication, 1–3, 8–10, 15, 36–37, 139–142, 295 code generation, 2, 290, 291–292 EPIC, xxviii–xxxv GRA and, 296 problems in, 15–16, 291 speculation and, 293–294 store instructions and, 295 branch registers, 29–30, 37–38, 55 aligned data and, 38 IP register and, 38
loop operations, 156–168 branch whether hint, 137 broadcast permutations, 193–195 BSP registers, 41, 44, See also application registers function call mechanisms, 149–150 BSPSTORE registers, 41, 44, See also application registers function call mechanisms, 149–150 bundles, instruction, 23–24, 59 format, 59, 60 syllables, 64 templates, 59–64 bytes, 19–27, 32–33
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index C cache memory, 6–7, 73–74 branch cache deallocation hint, 137, 138 optimization, 287 call branch instructions, 135 CCV register, 41, 45, See also application registers Celeron processors, 11 CFM registers, 29–30, 38–40 PFS registers and, 46, 154–156 register renaming and, 40 register stack frames, 39–40, 145, 154–156 rotating registers and, 40 character string instructions, 54, 119–120 check load, 14, 68–69, 72, 84, 85 CISC (Complex Instruction Set Computing) processors, xxiv–xxvii, 3 clearing memory counted loop program structure, 160 code generation/scheduling, 288–296 compare and exchange instruction, 88, 89–90 compare and exchange value (CCV) register, See CCV registers compare instructions, 54, 124–125 compare, 124–131 compare and exchange, 88, 89–90 compare word, 124, 131–132 floating-point, 267–270 parallel, 15–16, 142–144, 173, 182–184 predication and, 140 test bit, 124, 132–134 test NaT bit, 124, 132, 133–134
compare operations AND-type, 129, 268 normal, 127 parallel, 15–16, 142–144, 182–184 pseudo, 130–131 unconditional, 127 compensation code, 293 compilation, 2, 53 architecture, 283–284 communication hardware, xxviii, 2, 53, 59–61 dynamic profile-guided, 285–286 EPIC, xxx–xxxi instrumented, 285–286 optimization and, 283, 284, 286–288 process, 284–296 speculation, 2, 293–294 web addresses, 283 compiler, xxviii, 2, 283–296 computational model control definitions, 212 compute zero index instruction, 119–120 conditional branch instructions, 135, 136–137 branch return, 152 conditional function call, 151 conditional relations, See predicate registers; predication consumers, 87 control dependencies, 56–57, 79–80 control speculation, 13, 79–83 conversions of numbers decimal to floating-point, 204–206 integer/floating point, 223, 232–234 counted branch instructions, 124, 135 CPUID registers, 29–30, 31, 41, 48–50, 55, See also application registers current frame marker (CFM) registers, See CFM registers
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index D data, 19, 20, 169–171 aligned/unaligned, 23–24, 38 conversion, 204–206, 223, 232–234 floating-point, 203–207, 216–221, 229 formats, 20–22, 24–27 integer, 19–22, 223, 232–234 multimedia, 169–171 parallel, 171 parallel floating-point, 220–221 signed/unsigned, 20, 232–234 sign-extended, 21–22 sizes, 19–21, 22–24 zero-extended, 21 data arrangement instructions floating-point, 259–267 parallel, 193–202 data dependencies, 56–58, 83, 84 data speculation, 13 advanced load instructions and, 71–72 check load instructions and, 72 routine, 87 store barrier condition, 83–87 data warehousing, 22 decimal to floating-point conversions, 204–206 delays branch-mispredict, 8–9 memory latency, 12–13, 78–80 delay slots, 8–9
DeMorgan comparisons, 129, 268 dependencies, 11, 56–59, 82–84 instruction groups, 56–58, 60 memory access instructions, 78–79 deposit instructions, 112, 116, 117–119 divide algorithms, 242–247, 256 double-extended precision numbers, See precision numbers double precision numbers, See precision numbers double words, 19–21, 22–23, 32–33 downward code motion, 294–295 dynamic execution, 11 dynamic profile-guided compilation, 285–286 dynamic profiling, 285–286, 291
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index E e-Business and memory requirements, 22 EC register, 41, 45–46, See also application registers PFS register and, 46, 154–156 software pipelining, 160–162 EPIC (explicitly parallel instruction computing) architecture, xxiii–xxxv, 1–16 epilog counter (EC) registers, See EC registers epilog execution phase, xxxiv, 16, 164 exchange instruction, 88–89 compare and exchange, 88, 89–90 exclusive OR instruction, 105–107 execution, 61 dynamic, 11 instrumented, 285–286 parallel, xxxii, 10–12 execution cycles, 165–167 execution phases, 16, 164–165 execution unit types application registers, 98 instruction types and, 61 explicit integer, 205 explicit parallelism, xxx–xxxi, xxxv, 1, 2 extend size, 109 extract instruction, 112, 116–117
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index F false dependencies, 11 faults illegal operation, 97 register NaT consumption, 71 unaligned data reference, 89 feedback compilation, 285–286 fence semantics, 73 fetch and add immediate instructions, 88, 90–91 fill instructions, 75, 295 floating-point, 225, 227 fixed-point multiply add instructions, 254–255 fixed shift-and-mask operation, See deposit instructions; extract instruction flexibility and parallelism, 3 floating-point arithmetic instructions absolute maximum, 237 absolute minimum, 237, 238 absolute value, 235–236 add, 237–239, 242 fused multiply-add, 240–242, 253–254, 287 fused multiply-subtract, 240–242 maximum, 237, 239–240 minimum, 237 multiply, 237, 238 multiply add, 240–242, 253–254, 287 multiply subtract, 240–242 negate, 235–236 negate absolute value, 235–236
negative multiply, 237, 238, 240, 241 negative multiply add, 240–242 normalize, 235–236 one source operand register, 234–236 reciprocal approximation, 242–244, 267 reciprocal square root approximation, 242–244, 267 subtract, 237, 238, 242 three source operand registers, 240–242 two source operand registers, 237–240 floating-point check/clear flags instructions, 213, 215 floating-point class instructions, 267, 268, 270–271 floating-point compare/class instructions class, 267, 268, 270–271 compare, 267–269 parallel compare, 267–270 floating-point data arrangement instructions merge, 259–261 pack, 259, 263–264 parallel, mix, 259, 262–263 parallel merge, 259–261 sign extend, 259, 265–267 swap, 259, 264–265 floating-point data types, 203–207, 216–221, 229, 234 conversion, 223, 232–234 floating-point divide algorithms, 242–244, 256 optimized, 244–247 floating-point exclusive OR instructions, 256–257 floating-point fill instructions, 225, 227 floating-point integer instructions fixed-point multiply add, 254–255 floating-point multiply, 255 floating-point load instructions, 221–228 floating-point load-pair instructions, 221–224, 229 floating-point logic instructions AND complement, 256–257 exclusive OR, 256–257 logical AND, 224, 225, 256–258 logical OR, 256–257 select, 256–257, 258 floating-point maximum instructions, 237, 239–240 floating-point memory access instructions load, 221–228 load-pair, 221–224, 229 store, 221–224 floating-point merge instructions, 259–261
floating-point minimum instructions, 237 floating-point multiply add instructions, 240–242, 287 optimized, 253–254 floating-point multiply instructions, 237, 238 floating-point multiply subtract instructions, 240–242 floating-point negate instructions, 235–236 floating-point negative multiply add instructions, 240–242 floating-point negative multiply instructions, 237, 238, 240, 241 floating-point normalize instructions, 235–236 floating-point operations, RISC processors, xxv floating-point pack instructions, 259, 263–264 floating-point parallel compare instructions, 267–270 floating-point parallel merge instructions, 259–261 floating-point parallel mix instructions, 259, 262–263 floating-point reciprocal approximation instructions, 242–244, 267 floating-point reciprocal square root approximation instructions, 242–244, 267 floating-point registers, 29–31, 55, 203–271, See also application registers data types in, 216–221 IA-32 applications, 278 memory-to-register transfers, 65 UM register and, 215–216 floating-point register transfer instructions get transfer value/exponent/significand, 223, 231–232 move register, 223, 229–231 set transfer value/exponent/significand, 223, 231–232 floating-point remainder algorithms, 242–244, 256 floating-point select instructions, 256–257, 258 floating-point set controls instructions, 213, 214–215 floating-point sign extend instructions, 259, 265–267 floating-point spill instructions, 225, 227 floating-point square root algorithms, 242–244, 256 optimized, 248–253 floating-point status registers (FPSR), See FPSR floating-point store instructions, 221–224 floating-point subtract instructions, 237, 238, 242 floating-point swap instructions, 259, 264–265 floating-point units, 61–62 flush register stack operations, 124, 146, 150 formats 32-bit integer/address pointer instructions, 109 application memory address space, 23 arithmetic instructions, 102 assembly language instructions, 54–55
branch instructions, 135, 136, 137–138, 151–152 character string instructions, 119 compare and exchange instructions, 88, 89 compare instructions, 124, 125–130 exchange instructions, 88 fetch and add immediate instructions, 88, 90 floating-point arithmetic instructions, 235, 237–240–242, 243 floating-point compare/class instructions, 267, 268–269, 270–271 floating-point data arrangement instructions, 259 floating-point data types, 206–207 floating-point integer instructions, 255 floating-point logic instructions, 256–257 floating-point memory access instructions, 223–224 floating-point register transfer instructions, 223 FPSR status field instructions, 213–214 function calls, 151–152 instruction bundles, 59, 60 integer data types, 21 invoke IA-32 instruction set branch operation, 277 jump to IA-64 instructions, 278 large constant generating instructions, 107 load fill, 75 load instructions, 66–70, 75 logical instructions, 106 move instructions, 92, 93, 94, 97 parallel arithmetic instructions, 173, 174, 176, 184, 185 parallel data arrangement instructions, 194 parallel shift instructions, 191 population count instructions, 119 register stack frame operations, 146 shift and bit-field instructions, 112 speculation check, 82 store instructions, 76–77, 78 store spill, 78 user mask instructions, 92, 99 formats, register application, 30 branch, 30, 38 BSP, 44 BSPSTORE, 44 CFM, 30, 39 CPUID, 49 EC, 46 floating-point, 30, 208, 216–218
FPSR, 209–212 general, 30, 32, 34, 147, 171 IP, 30 multimedia, 170 PFS, 47 PMD, 50 predicate, 30, 36, 95 RNAT, 44 RSC, 42 UM, 48, 215–216 486 processors, xxvi, xxxiv FPSR, 41, 209–213, See also application registers FPSR status field instructions floating-point check flags, 213, 215 floating-point clear flags, 213–214 floating-point set controls, 213, 214–215 frame marker and register stack frame, 38 function call mechanisms CFM registers, 38 function call process, 153–156 function support instructions, 150–153 parameter passing, 146–148 program structure, 150–153, 154 register stack frames, 38, 144–148 RSE, 148–150 static registers and, 35 F-units, 61–62 fused multiply-add instructions, 240–242 optimized, 253–254, 287 fused multiply-subtract instructions, 240–242
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index G general move instructions, 92–93 general register compare instructions, 124, 131–133 general registers, 29–30, 31–36, 38, 55–56 32-bit integer/address pointer instructions, 108–112 function call mechanisms, 144–148 IA-32 applications, 278 memory-to-register transfers, 65 get floating-point transfer value/exponent/significand instructions, 223, 231–232 global code scheduler, 292, 294 global register allocator (GRA), 295–296 global scalar optimization, 284, 288
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index H Harvard architecture, 7
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index I IA-32 compatibility, xxii, 20, 40, 41, 108–112 32-bit integer/address pointer instructions, 108–112 application invocation, 275–278 Itanium operating environment, 273–275 microarchitectures, xxxiv–xxxv IEEE-754 results, 244 IEEE real numbers, 203, 218–219 parameters, 204 IF-THEN-ELSE program structure, predication, 141–142 illegal operation fault, 97 ILP, 1, 2, 53, 292 immediate data type, 19, 20 indirect addressing, 56, 66 indirect target address, 135 instruction abbreviations, See mnemonics instruction bundle format, 59–61 instructional pointer (IP) registers, See IP registers instruction groups, 56, 58–59 bundles, 23–24, 59–64 dependencies and, 56–58, 60 execution unit types and, 61 formats, 59, 60 IP register, 31 syllables, 64 syntax, 58 templates, 59–64 instruction level parallelism (ILP), See ILP
instruction set architecture (ISA), 53–56 instruction set transition model, 275–276 instruction sets, 276–278 instruction slot fields, 59, 62 instrumented compilation, 285–286 instrumented execution, 285–286 integer, explicit, 205 integer arithmetic instructions add, 81, 102–104 shift left and add, 102, 105 subtract, 102, 104 integer computation instructions, 54, 101 32-bit integer/address pointer, 108–112 arithmetic, 102–105 large constant generating, 107–108 logic, 105–107 shift and bit-field, 112–119 integer data types, 19, 20 conversion, 223, 232–234 formats, 21 signed/unsigned numbers, 20, 232–234 sign-extended, 21–22 zero-extended, 21 integer division, 256 integer/floating-point conversion instructions floating-point to integer, 223, 232–233 integer to floating point, 223, 232, 233–234 parallel floating-point to integer, 223, 234 integer floating-point multiply instructions, 255 integer logic instructions, See logic instructions integer multiplication, 254–255 integer remainder, 256 integer units, 61–61, 82, 97, 101 interprocedural analysis, 284, 286–287 interval time counter (ITC) registers, See ITC registers invoke IA-32 instruction set branch operation, 275, 276–277 IP registers, 29–30, 31, 38 IP-relative target address, 135, 151 ISA, 53–56 Itanium architecture, xxiii–xxxv, 1–3, 11, 12–17 compiler, 283–296 memory view, 74 operating environments, 273–275
Itanium processor, xxx, 1, 21–22, 74 instruction groups, 56, 59–60, 64 ITC registers, 41, 45, See also application registers I-units, 61–61, 82, 97, 101
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index J jump to Itanium instructions, 277–278
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index K kernel execution phase, 164 kernel registers, 41, 42, See also application registers
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index L large constant generating instructions, 107–108 LC register, 41, 45–46, 158–159, 160, See also application registers software pipelining, 160–162 little-endian memory byte order (format), 24–26, 59 load instructions, 66–75 advanced, 13–14, 67–69, 71–72, 85–86, 295 check, 14, 68–69, 72, 84, 85, 86 floating-point, 221–228 floating-point load-pair, 221–224, 229 formats, 66–70 load fill, 45, 69, 74–75 load hints, 67, 69, 73–74, 88, 91 load types, 67, 68–69, 88 partial redundancy elimination, 288 predicated, 124 redundant, 288 speculation and, 78–87 load/store architecture, xxv–xxvii, 4–7 local registers, 34, 35, 39–40, 145 logic instructions AND, 81, 105–106 AND complement, 106, 107 exclusive OR, 105–107 floating point, 256–258 OR, 105–106 loop count (LC) register, See LC register loop operations counted, 45–46, 156–157, 158–160, 162–168
loop count (value), 46 nonpipelined, 157–160, 161 problems of, 16 recoding, 167–168 software-pipelined, xxxiii–xxxiv, 160–168 software pipeliner scheduler, 290 stage predicates, 37 vectorization optimization, 287 while, 45, 156–158, 162 loop program structure support registers, 41, 45–46, See also application registers
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index M maximum instructions floating-point, 237, 239–240 parallel, 173, 188–190 memory access instructions, 54, 65 compare and exchange, 88, 89–90 exchange, 88–89 fetch and add immediate, 88, 90–91 floating point, 221–229 load, 66–75 semaphore, 87–91 speculation and, 78–87 store, 75–78 memory address space, 2 memory architecture, 74 memory byte order, 22–27, 33 memory latency delays, 12–13, 78–79, 80 memory optimization, 287 memory ordering semantics, 73, 89 memory units, 61–62, 82, 97, 101 merge instructions, floating-point, 259–261 minimum instructions floating point, 237 parallel, 173, 188–190 mix instructions floating-point parallel, 259, 262–263 parallel, 194, 197–199 mix permutations, 193–195 MMX technology, 169, 172, 278
mnemonics 32-bit integer/address pointer instructions, 109 arithmetic instructions, 102 assembly language syntax, 54–55 branch instructions, 135 character string instructions, 119 compare instructions, 124 execution types, 61 floating-point arithmetic instructions, 235, 237–240–242, 243 floating-point compare/class instructions, 267, 268–269, 270–271 floating-point data arrangement instructions, 259 floating-point integer instructions, 255 floating-point logic instructions, 256–257 floating-point memory access instructions, 223–224 floating-point register transfer instructions, 223 FPSR status field instructions, 213–214 indirect register instructions, 93 instruction types, 61 invoke IA-32 instruction set branch operation, 277 jump to IA-64 instructions, 278 large constant generating instructions, 107 load instructions, 66, 68, 69 logic instructions, 106 move instructions, 92 parallel arithmetic instructions, 173, 174, 176, 184, 185 parallel data arrangement instructions, 194 parallel shift instructions, 191 population count instructions, 119 registers, 55 register stack frame operations, 146 semaphore instructions, 86, 89 shift and bit-field instructions, 112 speculation, 82 user mask instructions, 92 models application memory addressing, 22–27 cache, 74 computational, 212 instruction set transition, 275–278 modulo add operations, 173–174, 175–176 modulo-scheduled pipelined loop, 164 Moore’s law, xxxiv move instructions, 81 application register, 92, 97–98
branch register, 92, 94–95 floating-point register, 223, 229–231 general register, 92–93 indirect, 92, 93–94 instruction pointer, 92, 94 long, 107, 108 long immediate, 107–108 move predicts, 92, 95–97 move user mask, 92, 99 multimedia data structures, 169–171 multimedia instructions, 54, 172 parallel arithmetic, 172–190 parallel data arrangement, 193–202 parallel shift, 190–193 software modularity and, 53 multiplex instructions, 193–197 multiply instructions fixed-point multiply add, 254–255 floating-point, 237, 238 floating-point multiply add, 240–242, 253–254, 287 floating-point multiply subtract, 240–242 floating-point negative multiply, 237, 238, 240, 241 floating-point negative multiply add, 240–242 fused add/subtract, 240–242, 253–254, 287 integer floating-point multiply, 255 parallel, 173, 184–185 parallel multiply and shift right, 173, 184–187 multi-ported data cache design, 7 multiway branch generation, 295 M-units, 61–62, 82, 97, 101
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index N NaN, 218, 220, 244 NaT bit, 13 in compare instruction, 127 control speculation and, 81–82 in general registers, 33, 94–95 GRA and, 295 in load instructions, 71, 72, 75 in move instructions, 94–95, 96, 97 in RNAT, 41, 44–45, 150 in semaphore instructions, 89, 90 in store instructions, 76 test NaT bit instructions, 124, 132, 133–134 in UNAT, 41, 45 NaTVal bit, 13, 209, 218, 220 floating-point arithmetic instructions, 239, 240 floating-point logic instructions, 257 negative numbers, See signed/unsigned numbers nested functions program structure, 154 nonpipelined counted loop operations, 158–160 program structures, 159, 160 nonpipelined while loop operations, 157–158 program structures, 157, 158 nonspeculative load, See normal load nontemporal structure cache memory, 73–74 normal compare operations, 127 normalize instructions, 235–236 normal load, 69, 70–71
normal store, 76, 77 not a number (NaN), See NaN not a thing (NaT) bit, See NaT bit numbers conversions, 204–206, 223, 232–234 floating-point, 203–206 IEEE real, 204, 218–219 precision, 204–206 signed/unsigned, 20–21, 203, 219–220, 232–234 sign-extended, 21–22, 96, 118 zero-extended, 21, 89, 90, 99
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index O optimization of code, 141–144 cache memory, 287 compilation and, 283, 284, 286–288 floating-point divide algorithm, 244–247 floating-point fused multiply-add instructions, 253–254, 287 floating-point square root algorithm, 248–253 global scalar, 284, 288 parallelization, 287 vectorization, 287 vectorization with parallel compares, 142–144 vectorization with predication, 139–142 ordered check load, 68–69, 73 ordered load, 68–69, 73 ordered store, 76 ordinal data type, 19, 20 OR instructions exclusive, 105–107, 256–257 floating-point, 256–257 integer, 105–106 logical, 256–257 OR-type comparisons, 129, 268 output registers, 34, 35, 39–40, 145
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index P P6-microarchitecture, xxvii, xxxiv–xxxv, 11 pack instructions floating-point, 259, 263–264 parallel, 194, 199–201 parallel add instructions, 172–176 parallel arithmetic instructions add, 172–176 average, 176–178 average subtract, 173, 181–182 compare, 15–16, 142–144, 173, 182–184 maximum, 173, 188–190 minimum, 173, 188–190 multiply, 173, 184–185 multiply and shift right, 173, 184–187 shift left/right and add, 173, 178–180 subtract, 173, 180–181 sum of absolute difference, 173, 187–188 parallel average instructions, 176–178 parallel average subtract instructions, 173, 181–182 parallel compare instructions, 173, 182–184 parallel compare operations, 15–16, 182–184 EPIC, xxviii, xxxiii, xxxv optimizing code, 142–144 program structure, 142–144 parallel comparison types, 124, 129–130 parallel data, 171, 262 parallel data arrangement instructions mix, 194, 197–199
multiplex, 193–197 pack, 194, 199–201 unpack, 194, 201–202 parallel execution, xxx, 10–11 parallel floating-point data, 220–221 conversion, 234 parallelism, See also ILP EPIC, xxviii–xxxi, xxxv, 1–3 problems in, xxix parallelization optimizations, 287 parallel maximum instructions, 173, 188–190 parallel minimum instructions, 173, 188–190 parallel multiply and shift right instructions, 173, 184–187 parallel multiply instructions, 173, 184–185 parallel semantics, xxvi, 12 parallel shift left/right and add instructions, 173, 178–180 parallel shift left/right instructions, 190–193 parallel subtract instructions, 173, 180–181 parameter passing, 146–148 partial dead store elimination, 288 partial redundancy elimination, 288 Pentium processors, xxvii, xxxiv–xxxv, 9, 10–11, 274, 278 performance, 3 aligned data and, 23–24 branch predication and, 9–10 caches and, 6–7 EPIC, xxviii–xxxiv ITC registers and, 45 performance monitor data (PMD) registers, See PMD registers PFS registers, 41, 46–47, See also application registers CFM registers and, 46, 154–156 pipelined counted loop operations, 162–168 execution cycles, 165–167 execution phases, 164 program structures, 162 pipelined while loop operations, 162 pipelining, 3–4 branching and, 8–9 EPIC, xxxii floating-point approximation instructions, 242 load/store, 4–6 loops and, 16–17 parallel execution vs., xxxii, 10
software, 16 plus 1 register form, add instructions, 102, 103 PMD registers, 29–31, 41, 50–51, 55, See also application registers pointer data type, 19, 20, 22 population count instructions, 54, 119, 120–121 positive numbers, See signed/unsigned numbers precision numbers, 204–206 predicate query system, 292 predicate regions, 290, 291 predicate registers, xxxv, 29–30, 36–37, 55 branch instructions, 134–138 conditional instructions, 123–124 function calls, 150–156 predicates, 36–37, 123–124 predication, 1–3, 8–10, 15, 36–37, 295 code generation, 2, 290, 291–292 eliminating conditional branches, 139–142 EPIC, xxviii–xxxv GRA and, 296 problems in, 15–16, 291 speculation and, 293–294 store instructions and, 295 previous function state (PFS) register, See PFS registers processor identification (CPUID) registers, See CPUID registers processor status register, 47 profile feedback, See dynamic profiling profiling, 284–286 program state, 35 program structures clearing memory counted loop, 160 counted loops, 159, 160, 162 data speculation scheme, 87 function call, 150–153, 154 IF-THEN-ELSE, 141–142 nested functions, 154 nonpipelined counted loop, 159, 160 nonpipelined while loop, 157, 158 parallel compare, 142–144 pipelined counted loop, 162 search for match while loop, 158 while loops, 157, 158, 160 prolog execution phase, xxxii, 16, 164 pseudo-operations
add, 103–104 compare, 130–131 floating-point, 236
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index Q quad words, 19–21, 22–23 byte order and, 24–27 general registers, 32–33 qualifying predicates, 36, 54
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index R RAW (read-after-write dependencies), 57, 59, 82, See also ambiguous memory dependencies read hit/miss, 6 real numbers, 203, 204, 218–219 recovery routine, 14, 79, 81, 82, 85–87 redundant loads/stores, 288 register file range, double precision computation, 213 register file range, single precision computation, 213 register file transfer instructions, 54 register NaT consumption fault, 71 register rename base (rrb) instructions, 124, 155, 209 register renaming, 11, 40 registers, 29–30, 55, See also specific registers assembly language notation, 1, 3, 55–56 byte ordering in, 24–27 compilation and, 2 NaT bit, 13, 33, 41, 44–45, 94–95, 150 size of, 1, 3 speculation and, 13 registers, rotating, 3, 16–17 CFM and, 40 EPIC, xxviii, xxx, xxxiii–xxxv floating-point, 208 general, 36 GRA and, 295 predicate, 36, 37, 96 software pipelining and, 17, 295
registers, stacked, 33–35, 36, 38, See also registers, rotating registers, static floating-point, 208 general, 33–34, 35 predicate, 36, 37, 96 register stack backing store, 148 register stack configuration (RSC) register, See RSC register stack engine backing store pointer (BSP) register, See BSP registers register stack engine backing store pointer for memory stores registers, See BSPSTORE registers register stack engine NaT collection register, See RNAT registers register stack engine (RSE), See RSE register stack frames, 35, 38, 39 dirty/clean, 148 function call process, 154–156 organization, 144–146 parameter passing, 146–148 register stack switching, 53 register transfer instructions, 91, See also floating point register transfer instructions move instructions, 91–98 user mask instructions, 92, 98–100 release semantics, 73, 76, 89 reserved fields, 43 reset user mask instructions, 92, 100 RISC (Reduced Instruction Set Computing) processors, xxiv–xxvii RNAT registers, 41, 44–45, 150 rotating registers, See registers, rotating round-away-from zero method, 176–178 rounding, 176–178 control definitions in FPSR, 212 precision and, 205 round-away-from zero method, 176–178 RSC, 41, 42–43, See also application registers function call mechanisms, 149–150 RSE, 148 modes, 42–43, 150 operation, 148–150 support registers, 41, 42–45 RSE backing store pointer (BSP) register, See BSP registers RSE backing store pointer for memory stores register, See BSPSTORE register RSE NaT collection register, See RNAT
registers
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index S saturated add operations, 173–176, 178 saturation limits, 174–175 scalability, 3 sequential semantics and, 12 search for match while loop program structure, 158 segment selectors in IA-32 applications, 278 semantics, xxviii, 12, 73, 89 semaphore instructions, 87–91 sequential prefetch hint, 137–138 sequential semantics, 12 set controls instructions, 213, 214–215 set floating-point transfer value/exponent/ significand instructions, 223, 231–232 set user mask instructions, 92, 99–100 shift instructions integer, 102, 105, 109, 111–119 parallel, 173, 178–187, 190–193 shift left and add instructions, 102, 105 shift left and add pointer instructions, 109, 111–112 shift left/right instructions, 112–114 shift right pair instructions, 112, 114–115 shuffle permutations, 193–196 sign, floating-point number, 203 signed/unsigned numbers, 20–21, 203, 232–234 infinity, 219–220 zero, 220 sign-extended numbers, 21–22 deposit instructions, 118
move instructions, 96 sign-extend instructions floating-point, 259, 265–267 integer, 108–109 sign field, 20–21 significand, floating-point number, 203 single instruction multiple data (SIMD) instruction technology, 169, 172, 278 single-precision number, See precision numbers size of rotating portion (SOR) parameter, 39 software modularity, 53 software pipelining loop operations and, xxxiii–xxxv, 16, 160–168 loop program structure support, 41, 45–46 rotating registers and, 17 software pipeliner scheduler, 290 special purpose registers, See application registers speculation, 1–2, 12–14 compilation and, 2, 293–294 EPIC, xxviii–xxxv load instructions and, 78–87 partial redundancy elimination and, 288 predication and, 293–294 problems in, 13–14, 294 registers and, 13 store instructions, 78 speculation check, 13, 79, 80, 81–83, 85–87 predicated, 124 speculative advanced load, 67, 68–69 speculative load, 67, 68–69, 70–71, 79–80 EPIC, xxxii–xxxiii spill instructions and, 74–75 spill instructions, 45, 78 floating point, 225, 227 speculative loads and, 74–75 stacked registers, See registers, rotating; registers, stacked stack frames, See register stack frames stage predicates, 37 static profiling, 285 static registers, See registers, static stop mechanism in instruction bundle templates, 61, 63 store barrier condition, 83, 84 store instructions, 75–78
partial dead store elimination, 288 predication and, 295 speculation, 78–87 store hints, 69, 76–77 store spill, 78 subtract instructions floating-point, 237, 238, 242 floating-point multiply subtract, 240–242 fused multiply-subtract, 240–242 integer, 102, 104 parallel, 173, 180–181 parallel average, 173, 181–182 superscalar, 10–11 swap instructions, 259, 264–265
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index T tags, 71, 72 target address, branch instructions, 135 templates, 63 compiler/processor communication and, 53 fields, 59–64 instruction bundles, 59–64 types, 62 temporal structure cache memory, 73–74 test bit instructions, 124, 132–134 test NaT bit instructions, 124, 132, 133–134 32-bit integer/address pointer instructions add pointer, 109, 110 shift left and add pointer, 109, 111–112 sign-extend, 108–109 zero-extend, 108–109 token in control speculation, 81 trip counts in loops, 16 2’s complement notation, 21
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index U UM registers, 29–30, 41, 47–48, 98–99, See also application registers byte ordering, 24 floating-point registers and, 215–216 unaligned data reference fault, 89 UNAT registers, 41, 45, 75, 295, See also application registers unconditional compare operations, 127 unordered semantics, 73 unpack instructions, 194, 201–202 upward code motion, 294 user mask (UM) registers, See UM registers user mask instructions, 92, 98–100 user NaT (UNAT) collection register, See UNAT registers
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index V vectorization optimizations, 287 virtual address space, 22 VLIW (very long instruction word), 2, 3, 12 processors, xxiii, xxvi von Neumann architecture, 7
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index W WAR (write-after-read dependencies), 57–58, See also ambiguous memory dependencies wavefront scheduling, 293 WAW (write-after-write dependencies), 57, See also ambiguous memory dependencies web addresses algorithms, 244, 248, 256 compilers, 283 words, 19–21, 22–23, 32–33
TeamUnknown Release
Index Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
Index Z zero-extended numbers, 21 semaphore instructions, 89, 90 user mask instructions, 99 zero-extend instructions, 108–109
TeamUnknown Release
List of Figures Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
List of Figures Chapter 1: Introduction Figure 1.1: Pipelined Execution in a Load/Store Architecture
Chapter 2: Data, Code, and Memory Figure Figure Figure Figure
2.1: Integer Number Formats 2.2: Application Memory Address Space 2.3: Little-endian Data 2.4: Big-endian Data
Chapter 3: Register Resources Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure
3.1: Application Register Set 3.2: General Register Set 3.3: Static and Stacked General Registers 3.4: Organization of the Predicate Registers 3.5: Branch Registers 3.6: Frame Marker Format 3.7: RSC Register Format 3.8: BSP Register Format 3.9: BSPSTORE Register Format 3.10: RNAT Register Format 3.11: Epilog Count Register Format 3.12: PFS Format 3.13: User Mask Format 3.14: CPUID Register File and Fields 3.15: Performance Registers
Chapter 4: Application Instruction Set Figure Figure Figure Figure Figure
4.1: Identifying Instruction Groups 4.2: Instruction Code Bundle Format and Example 4.3: Bundle and Instruction Group Boundaries 4.4: Instruction Groups and Corresponding Bundles 4.5: Syllable of Code
Chapter 5: Memory Access and Register Transfer Instructions Figure Figure Figure Figure Figure
5.1: Architectural View of Memory 5.2: Branch Barrier Removal with Control Speculation 5.3: Store Barrier Removal with Data Speculation 5.4: Data Speculation Routine Involving Speculation Check and Recovery 5.5: Relationship Between Predicate Registers and Bit Locations
Chapter 6: Integer Computation, Character String, and Population Count Instructions Figure Figure Figure Figure Figure
6.1: Add Pointer Operation 6.2: Shift Left and Add Pointer Operation 6.3: 128-bit-input Funnel Shift Operation 6.4: Extract Operation 6.5: Deposit Operation
Chapter 7: Compare and Branch Instructions Figure 7.1: Branch Elimination through Predication Figure 7.2: Optimized and Nonoptimized Predication for IF-THEN-ELSE Program Structure Figure 7.3: Optimizing Code with the Parallel Compare Operation Figure 7.4: Organization of the Register Stack Figure 7.5: Treatment of Functions Figure 7.6: Register Stack Engine Operation Figure 7.7: Structure of a Function Figure 7.8: Conditional Execution of Function Calls Figure 7.9: Nested Functions A and B Figure 7.10: Stack Frame Transitions for Call and Return of Function B Figure 7.11: Nonpipelined While Loop Program Structure Figure 7.12: Search for a Matching Quad Word in a Block of Memory Figure 7.13: Nonpipelined Counted Loop Program Structure Figure 7.14: Clearing a Block of Memory Figure 7.15: Nonpipelined versus Pipelined Loop Execution
Figure 7.16: Three Treatments of the Counted Loop
Chapter 8: Multimedia Instructions Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure
8.1: Little-endian Ordered Multimedia Parallel Data in Memory and Registers 8.2: 4x16 Parallel Add Operation 8.3: 4x16 Parallel Average Operations with Rounding 8.5: 4x16 Parallel Average Subtraction Operation 8.4: 8x8 Parallel Subtraction Operation 8.6: 4x16 Parallel Comparison Operation 8.7: Example of an 8x8 Parallel Comparison 8.8: Right and Left Parallel Multiplication Operations 8.9: Parellel Multiplication and Shift Right Operation 8.10: Parallel Sum of Absolute Difference Operation 8.11: 4x16 Parallel Minimum Operation 8.12: 4x16 Parallel Maximum Operation 8.13: 4x16 Parallel Shift Left Operation 8.14: 8x8 mux Operations 8.15: 4x16 mux2 Operations 8.16: 4x16 mix Operations 8.17: Operation of the mix1.r Instruction 8.18: 4x16 Pack Operation 8.19: Operation of the pack4.sss Instruction 8.20: 4x16 Unpack Operations 8.21: Operation of the unpack4.l Instruction
Chapter 9: Floating-point Architecture Figure 9.1: Floating-point Register File Figure 9.2: Floating-point Status Register Format Figure 9.3: Floating-point Status Field Format Figure 9.4: User Mask Format Figure 9.5: Format of a Floating-point Register Figure 9.6: Saving and Loading a Single-precision Floating-point Number from Memory Figure 9.7: Saving and Loading a Double-precision Floating-point Number from Memory Figure 9.8: Saving and Loading a Double-extended Precision Floating-point Number from Memory Figure 9.9: Spilling the Content of a Floating-point Register to Memory Figure 9.10: Saving and Loading an Integer Number in Floating-point Memory Format into a Floating-point Register Figure 9.11: Optimizing a Recurrence Floating-point Computation
Figure Figure Figure Figure Figure Figure
9.12: Floating-point Merge Operations 9.13: Parallel Floating-point Merge Operations 9.14: Floating-point Parallel Mix Operations 9.15: Floating-point Pack Operation 9.16: Floating-point Swap Operations 9.17: Floating-point Sign Extend Operations
Chapter 10: IA-32 Compatibility and Application Execution Figure 10.1: Compatibility with IA-32 Applications Figure 10.2: Instruction Set Transition Model
Chapter 11: Compiler Technology for the Itanium-based Applications Figure Figure Figure Figure
11.1: Compilation Process 11.2: Dynamic Profile-guided Compilation 11.3: Code Generator Compilation Order 11.4: Impact of Predicated Execution on Branch Prediction
TeamUnknown Release
List of Tables Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
List of Tables Chapter 2: Data, Code, and Memory Table 2.1: Data Types and Sizes Table 2.2: Range for Unsigned and Signed Integer Numbers
Chapter 3: Register Resources Table Table Table Table Table Table
3.1: Frame Marker Field Descriptions 3.2: Description of the Application Registers 3.3: RSC Field Descriptions 3.4: PFS Field Descriptions 3.5: User Mask Field Descriptions 3.6: CPUID Register 3 Field Descriptions
Chapter 4: Application Instruction Set Table Table Table Table
4.1: Register File Notation for Assembly Language Statements 4.2: Types of Data Dependencies 4.3: Relationship between Instruction Type and Execution Unit Type 4.4: Template Field Encoding and Instruction Slot Mapping
Chapter 5: Memory Access and Register Transfer Instructions Table Table Table Table Table Table
5.1: Load Instruction Formats 5.2: sz Completers 5.3: Load Type Completers 5.4: Supported Memory Load Instructions 5.5: Load and Store Hint Completers 5.6: Store Instruction Formats
Table Table Table Table Table Table Table Table
5.7: Store Type Completers 5.8: Supported Memory Store Instructions 5.9: Speculation Check Instruction Formats 5.10: Semaphore Instruction Subgroup 5.11: Semaphore Type Completers 5.12: Register Transfer Instructions 5.13: Indirect Register File Mnemonics 5.14: Execution Unit Required to Access Application Registers
Chapter 6: Integer Computation, Character String, and Population Count Instructions Table Table Table Table Table Table Table Table
6.1: Integer Arithmetic Instructions 6.2: Logical Instructions 6.3: Large Constant Generation Instructions 6.4: 32-bit Address Pointer and 32-bit Integer Instructions 6.5: Available Values for the xsz Part of Mnemonic 6.6: Shift and Bit Field Instructions 6.7: Character String and Population Count Instructions 6.8: Result Ranges for czx
Chapter 7: Compare and Branch Instructions Table 7.1: Compare Instruction Formats Table 7.2: Compare Relationship Completers for Normal and Unconditional Instruction Types Table 7.3: Comparison Type Completers Table 7.4: Compare Relationship Completers for Parallel Types Table 7.5: Compare Relationships Implemented as Pseudo-operations for Normal and Unconditional Types Table 7.6: Test Bit and Test NaT Bit Relationships for Normal and Unconditional Type Completers Table 7.7: Branch Instruction Formats Table 7.8: Branch Types Completers Table 7.9: Branch Whether Hint Completers Table 7.10: Sequential Prefetch Hint Completers Table 7.11: Branch Cache Deallocation Hint Completers Table 7.12: Register Stack Instructions Table 7.13: Phases of Execution Table 7.14: Pipelined Execution of the Loop Program
Chapter 8: Multimedia Instructions
Table Table Table Table Table Table Table Table Table Table
8.1: Parallel Arithmetic Instruction Formats 8.2: Saturation Completers 8.3: Parallel Saturation Limits 8.4: Parallel Average Completers 8.5: Parallel Comparison Relationships 8.6: Multiply Completers 8.7: Parallel Multiplication and Shift Right Count Options 8.8: Parallel Shift Instruction Formats 8.9: Data Arrangement Instruction Formats 8.10: mux Permutations for 8-Bit Elements
Chapter 9: Floating-point Architecture Table 9.1: IEEE Real-data Type Ranges Table 9.2: IEEE Real-data Type Memory Formats Table 9.3: Trap Fields of the Floating-point Status Register Table 9.4: Status Flags of the FPSR Status Field Table 9.5: Control Fields of the FPSR Status Field Table 9.6: Floating-point Rounding Control Definitions Table 9.7: Floating-point Computational Model Control Definitions Table 9.8: Floating-point Status Register Instruction Formats Table 9.9: Status Field Completers Table 9.10: User Mask Floating-point Field Description Table 9.11: Floating-point Register Encoding Table 9.12: Floating-point Memory Access, Register Transfer, and Data Conversion Instruction Formats Table 9.13: Supported Floating-point Load and Store Instructions Table 9.14: fsz Completers Table 9.15: Floating-point Arithmetic Instructions that Employ One Source Operand Table 9.16: Floating-point Arithmetic Instructions that Employ Two Source Operands Table 9.17: Precision Control Completer Table 9.18: Floating-point Arithmetic Instructions that Employ Three Source Operands Table 9.19: Floating-point Arithmetic Instructions that Perform Approximations Table 9.20: Integer Multiply Instruction Formats Table 9.21: Floating-point Logic Instructions Table 9.22: Floating-point Data Arrangement Instruction Formats Table 9.23: Floating-point Compare Instructions Table 9.24: Floating-point Compare Relationships Table 9.25: Floating-point Comparison Types Table 9.26: Floating-point Class Relationships Table 9.27: Floating-point Classes (fclass9) Options
Chapter 10: IA-32 Compatibility and Application Execution
Table Table Table Table
10.1: Itanium Processor Operating Environments 10.2: Invoke IA-32 Branch Instruction Format 10.3: Jump to Itanium Instruction Set 10.4: IA-32 Register Mapping
TeamUnknown Release
List of Code Examples Itanium Architecture for Software Developers by Walter Triebel Intel Press © 2000
Recommend this title?
List of Code Examples Chapter 3: Register Resources Example
Chapter 5: Memory Access and Register Transfer Instructions Example Example Example Example Example Example
Chapter 6: Integer Computation, Character String, and Population Count Instructions Example Example Example Example Example Example Example Example Example Example
Chapter 7: Compare and Branch Instructions
Example Example Example Example Example Example
Chapter 8: Multimedia Instructions Example Example Example Example Example Example Example Example Example Example Example
Chapter 9: Floating-point Architecture Example Example Example Example Example Example Example Example Example Example Example Example Example Example Example
TeamUnknown Release