From: Richard Tomkins [tomkinsr@home.com] Sent: Thursday, August 09, 2001 11:24 AM To: Info-VAX@Mvb.Saic.Com Subject: Re: Alpha-IA64 FAQ http://www.compaq.com/hpc/ref/ref_alpha_ia64.doc This document makes a strong case for the Alpha Processor. rtt Alpha and IA64 Executive Summary Applications have two types of parallelism: instruction-level parallelism and thread-level parallelism. Instruction-level parallelism enables a processor to issue multiple instructions in the same cycle. Instruction-level parallelism can be static (discovered by the compiler at compile-time) or dynamic (discovered by the processor at run-time). Thread-level parallelism enables a processor to run multiple threads, processes, or programs at the same time. An Alpha processor will exploit static and dynamic instruction-level parallelism with out-order execution, and thread-level parallelism with simultaneous multithreading. Out-of-order execution has a performance benefit of 1.5-2x over in-order execution. Simultaneous multithreading has benefit of 1.5-3x over single threaded execution. An IA64 processor will only exploit static instruction-level parallelism. It cannot take advantage of dynamic instruction-level parallelism or thread-level parallelism. IA64 defines a set of architectural extensions to permit compilers to identify more instruction-level parallelism. These architectural extensions will make it very difficult for an IA64 processor to implement out-of-order execution or simultaneous multithreading efficiently. For most applications, the small benefit that these architectural extensions give compilers does not equal the performance lost by not using these dynamic techniques. Alpha will be superior to IA64 on commercial applications. Commercial applications are very sensitive to code size. The IA64 instruction encoding increases the code size of a program by at least 33%, and the compiler techniques required by the IA64 introduce many additional instructions. Commercial programs are difficult to analyze at compile-time, and IA64 cannot dynamically adjust to program behavior at run-time. Commercial programs have very low instruction-level parallelism, but they are typically explicitly multithreaded. Each thread is very sequential and includes long delays waiting for memory. The IA64 strategy of searching for instruction-level parallelism cannot find the orders of magnitude improvements available to Alpha through simultaneous multithreading. Alpha will be superior to IA64 in high performance technical computing. Memory bandwidth and the scalability of the system limit the performance of most high performance technical applications. Future Alpha processors are adding a low-latency, high-bandwidth memory interface on chip, together with on-chip support for distributed shared memory. The next generation Alpha processors will have the fastest memory system in the industry. Alpha will be the leader in high performance technical computing. 1. Introduction Future Alpha processors will be developed around two architectural concepts: out-of-order execution and simultaneous multithreading. ? Out-of-order execution enables the processor to schedule the execution of instructions in an order that maximizes program performance. It has a proven benefit of 1.5-2x over in-order execution. ? Simultaneous multithreading (SMT) enables multiple threads (or processes) to run simultaneously on a single microprocessor. Most server applications are divided into multiple threads, and SMT permits these applications to take full advantage of the multiple execution units on the processor. SMT has a benefit of 1.5-3x over single threaded execution. These two features permit an Alpha processor to exploit both thread-level parallelism and instruction-level parallelism. The processor can use these two types of parallelism interchangeably, and dynamically adapt to the varying requirements of the application. Intel has chosen a markedly different direction than Alpha. Intel is introducing a new 64-bit instruction set architecture called IA64. They have called the architecture EPIC, for Explicitly Parallel Instruction Computing, but it is essentially a VLIW (Very Long Instruction Word) architecture. The IA64 architecture is very similar to the Cydrome machine, a failed minisupercomputer company of the 1980s. The first implementation of IA64 is called Merced, with a follow-on implementation called McKinley. With the IA64, Intel is focusing on a compiler-driven technology to increase instruction-level parallelism, and is ignoring other proven ways to improve performance on large applications. IA64 is developed for an in-order execution model, with a set of new architectural extensions to permit compilers to identify more instruction-level parallelism. These architectural extensions will make it very difficult for IA64 processors to implement out-of-order execution or simultaneous multithreading efficiently. For most applications, the small benefit that these architectural extensions give compilers do not equal the performance lost by not using these dynamic techniques. 2. Design Philosophy IA64: a smart compiler and a dumb machine The IA64 design is a derivative of the VLIW machines designed by Multiflow and Cydrome in the 1980s. The key idea is a generalization of horizontal microcode: in a wide instruction word the processor presents control of all of the functional units to the compiler, and the compiler precisely schedules where every operation, every register file read, every bypass, will occur. In effect, the compiler creates a record of execution for the program, and the machine plays that record. In the early VLIWs, if the compiler made a mistake, the machine generated the wrong results; the machine had no logic to check that registers were read in the correct order or if resources were oversubscribed. In more modern machines such as the IA64 processors, the machine will run slowly (but correctly) when the compiler is wrong. The IA64 design requires the compiler to predict at compile-time how a program will behave. Traditionally, VLIW-style machines have been built without caches and focused on loop-intensive, vectorizable code. These restrictions mean the memory latency is fixed and branch behavior is very predictable at compile-time. However, IA64 will be implemented as a general-purpose processor, with a data cache, running a wide variety of applications. In most applications, the latency of a memory operation is very difficult to predict; a cache-miss may have a latency that is 100 times longer than a cache hit. Alpha's out-of-order design can dynamically adjust to the cache pattern of the program; on an IA64 processor, when the compiler makes a mistake, the machine will stall. Similarly, the IA64 design requires the compiler to move code across branches to find parallelism. However, this decision requires the compiler to predict branch direction at compile-time. This is very difficult to do, and even with elaborate profile-feedback systems, where a program is run to gather information about its behavior before it is compiled, compile-time branch prediction rates are at best 85%. Without feedback, the compile-time rates are much closer to 50%. In contrast, hardware branch predictors are 95-98% accurate. An IA64 design will be executing unprofitable speculative instructions 3-10x more frequently than an Alpha design. The IA64 is an architectural idea that was developed for vectorizable programs. Intel has tried to extend it to commercial applications, but it is fundamentally the wrong design for these problems. Alpha: a smart compiler and a smart machine An explicit goal in the development of the Alpha architecture was to enable innovative performance improvements in compilers, architecture, and circuit implementation. We did not add features to the instruction set architecture that make compiler improvements easy but hardware improvements difficult. In the early 1990s, we designed a VLIW version of Alpha similar to IA64 [1,2,3,4,5,6]. During this process we discovered that most of the compiler technology for a VLIW processor could equally well be applied to a RISC processor, and that by avoiding IA64-style extensions to Alpha, we could also implement an out-of-order processor. Alpha is designed to exploit both compile-time and run-time information. We agree with the IA64 designers that the compiler should create a record of execution for a program. However, we also recognized that the processor will know at run-time additional information about a program's behavior, for example, whether a memory reference is a cache miss and what direction a branch executes. Rather than stall the processor when the compiler is wrong, we designed an out-of-order issue mechanism that allowed the machine to adapt to the run-time behavior of the program. In addition, a compiler has a restricted view of the program and often cannot optimize across routine or module boundaries. At run-time, an out-of-order processor can find parallelism across these boundaries. Compiler technology must be combined with out-of-order execution to extract the most instruction-level parallelism from a program. Simultaneous multithreading permits an Alpha processor to exploit thread-level parallelism in addition to instruction-level parallelism. Most commercial applications have very small amounts of instruction-level parallelism, but they are frequently composed of multiple parallel threads. SMT enables an Alpha processor to achieve large speedups on these applications. SMT also permits the processor to exploit instruction-level parallelism fully when it is available. Alpha is designed for a wide range of commercial applications. Its industry-leading memory bandwidth and high floating point performance will enable it to excel on scientific programs as well. Simultaneous multithreading is a natural extension of Alpha's out-of-order implementation. It is the most powerful mechanism for exploiting the explicit parallelism in most application workloads. 3. Alpha features Out-of-order execution Out-of-order execution is a combination of three techniques: ? Dynamic scheduling. The processor can reorder instructions to reduce processor stalls. ? Register renaming. The processor can rename registers to remove write-after-read and write-after-write hazards. ? Branch prediction. The processor can predict the direction of a branch before the branch is executed. The basic organization of a processor is the fetch-execute cycle. The processor fetches an instruction from the instruction cache, executes the instruction, updates the register file and memory, and retires the instruction. Pipelined systems overlap these stages for successive instructions. Out-of-order systems fetch multiple instructions into a queue (or instruction window). The processor can execute the instructions in the window "out-of-order", but it must retire the instructions and make their results visible in the register file and memory system in program order. These concepts are best understood by looking at some simple examples. Dynamic scheduling. Figure 1 presents a straight-line block of 4 instructions, written in pseudo-code. Consider executing this block on a 4-issue machine with a 3-cycle load latency on a cache hit. On an in-order pipeline, the sequence will take 7 cycles, assuming cache hits. On an out-of-order pipeline, the processor will fetch all 4 instructions at once and notice that the second LOAD is independent of the first ADD. It will move the second LOAD up to dual issue with the first, and the sequence will take 4 cycles. in-order out-of-order t1 = LOAD a0 0: t1 = LOAD a0 0: t1 = LOAD a0 t2 = ADD t1,1 1: t3 = LOAD a1 t3 = LOAD a1 2: 1: t4 = ADD t3,1 3: t2 = ADD t1,1 2: t3 = LOAD a1 3: t2 = ADD t1,1 4: t4 = ADD t3,1 5: 6: t4 = ADD t3,1 Figure 1: Dynamic scheduling. Of course, within a straight-line block of code, the compiler could (and should) schedule the code, moving the second LOAD above the ADD, and giving the in-order machine the same performance as the out-of-order. There are two situations where the compiler and the in-order processor cannot equal the out-of-order processor. The first is dealing with operations of variable latency. In the example above, assume each of the LOADs occasionally misses the cache and goes to memory, resulting in a latency of 100 cycles or more. The out-of-order machine will delay the ADD that is dependent on the load, but it will continue fetching and executing instructions that are not dependent on the LOAD. The in-order machine will stall at the ADD instruction, and will not execute any further instructions until the LOAD completes, resulting in a 100-cycle stall for the processor. A second situation is dealing with memory aliasing. If there was a STORE between the first ADD and the second LOAD, and the compiler cannot prove that the LOAD always refers to a distinct location from the STORE, then the compiler cannot move the LOAD above the STORE. In an out-of-order machine, the processor will know the addresses of the LOAD and the STORE, and determine if it is safe for them to issue out-of-order. IA64 has introduced architectural support for speculative execution to address the problems of variable latency and memory aliasing in an in-order processor. However, this support is not as powerful a solution as dynamic scheduling. Register renaming. Figure 2 presents a similar straight-line block of 4 instructions. Note that the second LOAD reuses the same register as the first. A compile-time scheduler for an in-order processor cannot move the second LOAD above the preceding ADD, because the second LOAD would overwrite the input of the ADD. This is called a write-after-read hazard. In the out-of-order processor, the architectural registers are mapped into a larger physical register file, and the result of each instruction is written to a distinct physical register. This removes all write-after-read and write-after-write hazards, and permits the processor to move the second LOAD above the ADD. in-order out-of-order t1 = LOAD a0 0: t1 = LOAD a0 0: p1 = LOAD a0 t2 = ADD t1,1 1: p3 = LOAD a1 t1 = LOAD a1 2: 1: t1 = ADD t1,1 3: t2 = ADD t1,1 2: t1 = LOAD a1 3: p2 = ADD p1,1 4: p4 = ADD p3,1 5: 6: t1 = ADD t3,1 Figure 2: Register renaming. Of course, within a straight-line block of code, the compiler could (and should) rename the architectural registers to permit the scheduler to reorder the instructions. This can typically be done unless the compiler has run out of architectural registers to allocate, or if there are some required bindings of the registers. For example, at a procedure call, the calling standard requires the use of some specific registers. IA64 has introduced a large number of registers to permit the compiler to do aggressive renaming. However, this register real estate can be more effectively used to hold a large physical register file for an out-of-order processor. Branch prediction. Figure 3 presents a two-block sequence of code, where we load and add a value, and then branch to L1. Assume the branch is predicted correctly. On an in-order machine, this sequence takes 9 cycles. Even though the branch is predicted correctly, the in-order processor cannot issue the instructions after the branch until the branch is issued. On an out-of-order machine, the correct branch prediction permits the processor to examine all 5 instructions at once, and dynamically schedule the second LOAD and ADD before the branch. in-order out-of-order t1 = LOAD a0 0: t1 = LOAD a0 0: p1 = LOAD a0 t1 = ADD t1,1 1: p3 = LOAD a1 BEQ t1, L1 2: 1: 3: t1 = ADD t1, 1 2: L1: 4: BEQ t1, L1 3: p2 = ADD p1,1 t1 = LOAD a1 5: t1 = LOAD a1 p4 = ADD p3,1 t1 = ADD t1,1 6: 4: BEQ p2, L1 7: 8: t1 = ADD t3,1 Figure 3: Branch prediction A compiler technique called trace scheduling permits the compiler to schedule all of the instructions along an execution trace as a single basic block. This would enable the compiler to schedule the second LOAD and ADD in parallel with the first. However, to perform this code motion profitably, the compiler needs to be able to predict the branch correctly at compile-time. Without any additional knowledge of the program behavior, the compiler will be wrong 50% of the time. Even with run-time feedback, the compiler will at best be correct 85% of the time. Good hardware branch predictors, which are used by an out-of-order machine, are correct 95-98% of the time. The compiler's prediction will be wrong 3-10 times as often as the out-of-order processor, and the performance of the in-order machine will not be able to equal the out-of-order processor. If a compiler moves a load above a conditional branch, it must be careful not to introduce an unwarranted exception, for the load will now be executed when the program did not intend it to be. IA64 has introduced some architectural features to address this problem. However, using this feature will also increase the code size of the program. Predicting through a function call. Figure 4 presents a 3-block sequence of code, where the first two blocks lead up to a function call. Assume the branch is predicted correctly. In the in-order machine, this requires 11 cycles. In the out-of-order machine, the correct branch prediction will permit the processor to see the instructions on both sides of the function call, and the processor can complete execution of the body of the function before the BSR instruction is executed! in-order out-of-order t1 = LOAD a0 0: t1 = LOAD a0 0: p1 = LOAD a0 t1 = ADD t1,1 1: p3 = LOAD a1 BEQ t1, L1 2: 1: 3: t1 = ADD t1, 1 2: L1: 4: BEQ t1, L1 3: p2 = ADD p1,1 BSR foo 5: BSR foo p4 = ADD p3, 6: t1 = LOAD a1 4: BEQ p2, L1 foo: 7: 5: BSR foo t1 = LOAD a1 8: 6: ret t1 = ADD t1,1 9: t1 = ADD t3,1 ret 10: ret Figure 4: Predicting through a function call Of course, in a simple example like this, the compiler could simply in-line the call itself, giving the in-order machine the same performance. However, inlining is often not a practical optimization, because the called routine is too large or the routine body is not available to the compiler. In summary, the Alpha out-of-order design has several advantages over the IA64 in-order design. Alpha can adapt to memory references that occasionally miss in the cache, avoiding delays of 100 cycles or more. Alpha can find parallelism when the architectural registers do not express it. And Alpha can find parallelism across branches and across function calls dynamically, at run-time. IA64 depends on compile-time predictions to obtain static instruction-level parallelism. IA64 relies on the compiler to predict which loads will miss the cache, how memory operations alias with each other, and the direction of each branch. If the compile-time predictions are correct, both IA64 and Alpha will perform well. But when the compiler is wrong, the out-of-order Alpha processor can adapt and continue to perform well. The in-order IA64 will run slowly. Simultaneous Multithreading Computer systems exploit two forms of parallelism: thread-level parallelism (TLP) and instruction-level parallelism (ILP). Thread-level parallelism enables a multiprocessor system to run multiple threads from an application, or multiple independent programs, at the same time. Instruction-level parallelism enables a superscalar processor to issue multiple instructions in the same cycle. Simultaneous multithreading (SMT) is a new technology that permits a processor to exploit both TLP and ILP. Multiple threads can run on an SMT processor, and the processor will dynamically allocate resources between threads, enabling a processor to adapt to the varying requirements of the workload. Most server applications are divided into multiple threads, and SMT permits these applications to take full advantage of the multiple execution units on the processor. The unique advantage of an SMT processor is that it can use thread level-parallelism and instruction-level parallelism interchangeably. Multiple threads can run in parallel in the parallel portion of an application; in the sequential portions, all of the processor resources can be applied to a single thread. Amdahl's law says that the performance of a parallel application is limited by the amount of time spent in the sequential portion; improving the performance of a parallel application requires speedups in both the parallel and sequential portions. An SMT processor can effectively deliver this speedup. The opportunity. Alpha's out-of-order execution and IA64's explicitly parallel instruction computing both find parallelism at the instruction level. They are both techniques for issuing multiple instructions per cycle from a single thread. The amount of potential instruction-level parallelism is dependent on the program and varies greatly from application to application. Figure 5 presents a simple example with a large amount of instruction-level parallelism. Assume we need to do 4 instructions of work for each element in array a, that the array is in the data cache, and that a load takes 3 cycles to return its value on a cache hit. On a dual issue machine, we can load a future element of an array while we work on the current one and utilize the functional units effectively. On a quad issue machine, we can load two future elements while working on two current elements, and continue to use the machine very effectively. for (i = 0; i < n; i++) work on a[i] dual issue ALU0 ALU1 0: LOAD a[i+1] . 1: work a[i] work a[i] 2: work a[i] work a[i] 3: LOAD a[i+2] . 4: work a[i+1] work a[i+1] 5: work a[i+1] work a[i+1] quad issue ALU0 ALU1 ALU2 ALU3 0: LOAD a[i+2] . LOAD a[i+3] . 1: work a[i] work a[i] work a[i+1] work a[i+1] 2: work a[i] work a[i] work a[i+1] work a[i+1] 3: LOAD a[i+4] . LOAD a[i+5] . 4: work a[i+2] work a[i+2] work a[i+3] work a[i+3] 5: work a[i+2] work a[i+2] work a[i+3] work a[i+3] Figure 5: An array problem. Figure 6 presents a similar example with limited instruction-level parallelism. We have changed the data structure from an array to a linked list. On a dual issue machine, we can load the next element of the list while we work on the current one and utilize the functional units effectively. However, on a quad issue machine, we can find no additional parallelism. We can only process one list element every three cycles, even if we can parallelize the work on an element. We cannot load future elements of the list until we have loaded the current one. This limitation is fundamental to the example, and we cannot find more instruction-level parallelism without rewriting the program. However, with simultaneous multithreading, we can use the unutilized functional units to run a second program, or another thread from the same program. while( a != 0) work on node a a = a->next dual issue ALU0 ALU1 0: an = LOAD a->next . 1: work a work a 2: work a work a 3: a = LOAD an->next . 4: work an work an 5: work an work an quad issue ALU0 ALU1 ALU2 ALU3 0: an = LOAD a->next . . . 1: work a work a work a work a 2: . . . . 3: a = LOAD an->next . . . 4: work an work an work an work an 5: . . . . Figure 6: A list problem. Scientific applications typically use arrays as data structures and they have a large amount of instruction-level parallelism. Out-of-order issue and explicitly parallel instruction computing will both perform well. Commercial applications typically use lists as data structures, and they have much less instruction-level parallelism; often they average less than one instruction per cycle. Only a technique that exploits thread-level parallelism, such as simultaneous multithreading, can fully utilize the functional units of the processor. Simultaneous multithreading. Simultaneous multithreading works by turning thread-level parallelism into instruction-level parallelism. An Alpha out-of-order processor has an in-order instruction fetch mechanism and an out-of-order execution mechanism. Instructions are fetched in the order that they appear in the program, and they are executed in an order determined by the data dependencies between the instructions. Between the fetch and execute portions of the machine is an issue queue. Instructions wait in the queue until they are ready to issue. See Figure 7. [In-order fetch] [Queue] [Out-or-order execution] [In-order retire] Figure 7: Alpha out-of-order pipeline SMT is implemented by enabling the fetch unit to fetch from a different thread every cycle. The fetched instructions have their registers renamed into a large physical register file, and then they are placed in the queue. Due to the register renaming, instructions from multiple threads can be mixed in the queue. The logic that determines whether an instruction is ready to issue only examines physical registers, and it can select instructions from different threads in the same cycle. In fact, it is very likely that instructions from different threads will issue in the same cycle, for they will reference different physical registers. There is some minor bookkeeping required to keep track of the threads, but the SMT can be implemented without any major changes to the processor pipeline, without an increase in the cycle time of the processor, and without a significant increase in the size of the chip. The processor dynamically adapts to the requirements of the threads with the instruction fetch policy. On a given cycle, we will fetch instructions from the thread that has fewest instructions in-flight (i.e., instructions that have been fetched but not retired). This prevents a single thread from filling the instruction queue, and makes sure multiple threads are executing at the same time, increasing the thread-level parallelism. It also enables the thread with the most instruction-level parallelism at this time to use the machine effectively. for each thread do a = todo[tid] while( a != 0) work on node a a = a->next Multiprocessor processor 1 processor 2 ALU0 ALU1 ALU2 ALU3 ALU4 5 6 7 ALU0 ALU1 ALU2 ALU3 ALU4 5 6 7 0: LDan . . . . . . . LDan . . . . . . . 1: Wa Wa Wa Wa . . . . Wa Wa Wa Wa . . . . 2: . . . . . . . . . . . . . . . . 3: LDa . . . . . . . LDa . . . . . . . 4: Wan Wan Wan Wan . . . . Wan Wan Wan Wan . . . . 5: . . . . . . . . . . . . Simultaneous multithreaded processor ALU0 ALU1 ALU2 ALU3 ALU4 ALU5 ALU6 ALU7 0: LDan0 LDan1 LDan2 LDan3 . . . . 1: Wa0 Wa0 Wa0 Wa0 Wa1 Wa1 Wa1 Wa1 2: Wa2 Wa2 Wa2 Wa2 Wa3 Wa3 Wa3 Wa3 3: LDa0 LDan1 LDan2 LDan3 . . . . 4: Wan0 Wan0 Wan0 Wan0 Wan1 Wan1 Wan1 Wan1 5: Wan2 Wan2 Wan2 Wan2 Wan3 Wan3 Wan3 Wan3 Figure 8: A multi-threaded list problem. Alternatives. SMT can deliver more performance than a conventional multiprocessor. Consider a multithreaded version of the list program presented in Figure 8. On two 8-issue processors arranged in a multiprocessor systems, we will be able to process two list elements every 3 cycles. On a single 8-issue SMT processor, with the capability of running 4 threads, we are able to process 4 list elements every 3 cycles. With half the number of functional units, we can double the performance. This is because the SMT machine can run more threads than the multiprocessor, and at the same time, offer each thread sufficient instruction-level parallelism. 4. IA64 features IA64 basics The IA64 architecture introduces 64 bit addressing and a new instruction set. It also contains an IA32 mode; all IA64 processors will be able to execute IA32 programs, though with disappointing performance. IA64 is a load/store architecture; memory can only be referenced by explicit load and store instructions. All other instructions operate on registers. There are five register files: 128 64-bit integer registers, 128 82-bit floating register files, 64 1-bit predicate registers, 8 branch registers, and 128 special purpose application registers. Each instruction typically has 2 register inputs, 1 predicate input, and 1 output. The instructions are gathered into 3 instruction bundles; the bundles are a mechanism for expressing parallelism between instructions. There is explicit architectural support for speculative execution, predication, function calls, and software pipelining. We will consider these features in detail below. Instruction format IA64 instructions are big. 3 instructions are packed into a 128-bit bundle; each instruction is 41 bits, with a 5-bit template field to describe the dependencies in the bundle. See Figure 9. 127 . 87 86 . 46 45 . 5 4 . 0 [ instruction slot 2 | instruction slot 1 | instruction slot 0 | template ] 41b 41b 41b 5b Figure 9: IA64 instruction format The larger instruction encoding is due to: ? Larger register files. A 128-entry register file requires a 7-bit identifier to select a register. 2 inputs specifiers and one output specifier require 21 bits in the instruction. ? Predication. Each instruction requires 6-bit predicate specifier, to select a predicate from a 64-entry predicate register file. ? Explicit parallelism. 5 bits in each bundle are devoted to describing the dependencies within a bundle, or between bundles. The architecture permits the compiler to group multiple bundles together, and indicate that all of the operations are data independent. The template bits indicate where a block of independent operations ends. Comparing the instruction encodings with Alpha, we see that IA64 instructions are 33% larger, based on their encoding alone. An Alpha instruction is 32 bits, and three instructions can be encoded in 96 bits. The corresponding IA64 bundle is 128 bits. Also, IA64 instructions can only be expressed in bundles of 3, and not all combinations of instructions are allowed. Some padding of bundles with no-ops is required. In addition, all the compiler techniques required for IA64 will increase code size. For important server applications such as on-line transaction processing, code size has a first order effect on performance, and IA64 will be at a competitive disadvantage. The instruction encoding permits VLIW-style parallelism. For example, Figure 10 shows the code for a simple loop that adds a constant to a vector, with the loop unrolled by 2. Each line in the figure is bundle of 3 instructions. The double semicolon indicates the end of a group of independent instructions. Note that almost half of the instructions are NOPs due to restrictions on the placement of instructions in bundles. For example, only one floating-point instruction can occur in a bundle. L1: LDFD f4 = [r5],8 LDFD f14 = [r15],8 NOP ;; NOP FADD f7 = f4, f9 NOP NOP FADD f17 = f14, f9 NOP ;; STFD [r6] = f7,8 STFD [r16] = f17,8 BR.cloop L1 Figure 10: A simple vector loop. Unlike traditional VLIWs, IA64 is fully scoreboarded. Bundles can be chained together to indicate a sequence of data independent operations. However, the machine must check for dependencies between these sequences. This removes much of the simplicity of the traditional VLIW. In addition, IA64 must deal with the variable latency of loads. The ability to chain together bundles also permits the compiler to present an arbitrary amount of parallelism to the hardware. This appears to remove one of the main weaknesses of a VLIW: that the code must be compiled for a specific width of machine. However, it is clear for best performance, the compiler must target a specific machine. For example, consider the simple loop in Figure 10. On a processor with two memory units, a software-pipelined schedule can be written which maximizes the use of these units. But on a processor with 3 memory units, the loop would need to be further unrolled; the current loop requires 4 memory references and a software pipelined schedule could only utilize the memory units 67% of the time. The statically scheduled IA64 model cannot dynamically adjust to utilize the machine resources efficiently. Large register files IA64 has large register files: 128 integer registers and 128 floating point registers. Registers have two uses: to store data that is used multiple times for fast retrieval and to express the parallelism in the program. Large register files have a proven benefit in scientific programs; for example, in blocked matrix solvers they can be used to hold a sub-array that is repeatedly referenced. However for general integer programs and commercial servers, where data is less regularly structured and typically accessed indirectly through pointers, it is more difficult to use a large register file effectively to hold commonly used data; in these applications the extra code space required to specify the register offsets the minor gains in using the register file to store data. A processor with simultaneous multithreading presents the compiler with a full architectural register file for each thread. It permits the programmer to use a large number of registers to hold data without increasing the instruction size. Registers are also required to express parallelism. Every operation that produces a value requires a register specifier; in effect, it is the name of the value. For example, to issue 8 operations per cycle requires 8 destination registers. If, on average, the result is read 10 cycle later, 80 registers are required to express the computation. On an in-order machine such as IA64, architectural registers are required to express this parallelism. This is one of the reasons IA64 has introduced larger register files and paid the price of large instructions. On an out-of-order machine, the architectural registers are mapped into a much larger physical register file. Physical registers can be used to express the parallelism discovered by the dynamic scheduler; the architectural register file and the instruction encoding do not need to grow. Predication Almost all IA64 instructions require a predicate register as an additional input. The instruction is executed only if the predicate is true. To support predication IA64 includes a powerful compare instruction to produce predicates, see Figure 11. The compare instruction compares r1 and r2, using a comparison given by crel (e.g., greater than). In a typical case, the result of the comparison is written to p1 and its complement to p2. This gives two predicates to control the two sides of an if-then-else statement. A number of other results are possible; this is controlled by the ctype qualifier. (qp) CMP.crel.ctype p1,p2=r2,r3 Figure 11: IA64 compare instruction Predication permits the compiler to remove a branch. It is a generalization of the Alpha conditional move instruction (CMOV); in IA64 terminology, CMOV would be called a predicated move. Figure 12 shows three ways to compute if (a) x = t+1. Alpha IA64 Alpha BNE a,.+2 CMP.EQ p1,p2 = a,0 ADD t,1,x0 ADD t,1,x (p1)ADD x = 1,t CMOV a,x0,x (1) (2) (3) Figure 12: Three ways to compute if (a) x=t+1 In (1) we use a branch; in (2), the IA64 predication; and in (3) the Alpha CMOV instruction. For many common cases such as this one, the more general IA64 predication does not improve the code. Predication turns a control dependence into a data dependence. In (1) above, the ADD instruction is not dependent on the branch. In an out-of-order processor, when the branch is predicted correctly, the ADD may issue before the branch. In the predicated IA64 version, the ADD must wait for the comparison. In many programs, particularly large commercial applications, the computation of the branch conditions are the critical path of the computation. Run-time branch prediction is extraordinarily effective; 95-98% of branches are predicted correctly. On an out-of-order machine, instructions can be issued before the branch that controls them resolves. In an in-order machine, predicated instructions must wait until their predicate is evaluated. This increases the critical path length through the program. Predication increases the instruction cache footprint of programs as well. It requires the predicated instructions from both sides of an if-then-else to be present in the instruction cache in order to execute either the then clause or the else clause. For tight inner loops, predication may be a reasonable decision, but for large loop-free programs such as databases or operating systems, requiring both sides of an if-then-else clause to be cache resident to execute only one of them will increase the instruction cache footprint and lower performance. Also, every IA64 instruction is larger in order to contain the predicate specifier. All code has a larger instruction cache footprint to support the feature, even large loop-free server applications where it cannot be profitably applied. Predication also prevents the full utilization of the functional units in the processor. If both sides of an if-then-else are scheduled in the same straight-line code, only half of the functional units are effectively used; the results of the functional units with a false predicate are discarded. On a single-threaded processor such as IA64, this may seem like a reasonable decision. However, on simultaneous multithreaded processor, these functional units are a valuable resource that can be used by another thread. Predication must be used judiciously, which Alpha can do with CMOV. Predication cannot remove branches when they are function calls. Figure 13 shows an excerpt from the 022.li benchmark from SPECint95. The static probability of each branch is reported; there are two probabilities on two lines, because the C && operator is an implied branch. This decision pattern is found in many commercial applications. All of the branches are highly biased, except for one, which decides whether to call evform or xlgetvalue. However, predication cannot be effectively applied to the 32% branch, because a transfer of control to one of the called functions is required. Note that this routine is an excellent candidate for an out-of-order processor, because all of the branches (including the 32% branch) can be effectively predicted by a run-time predictor, and processor can overlap the execution of either evform or xlgetvalue with the evaluation of the branch conditions in xleval. NODE *xleval(NODE *expr) { if (--xlsample <= 0) ... /* 1% */ if (*++xltrace < TDEPTH) /* 100% */ trace_stack[xltrace]=expr; if (expr && expr->n_type==LIST) /* 99.9%, 32.0% */ expr = evform(expr); else if (expr && expr->n_type==SYM) /* 99.9%, 99.9% */ expr = xlgetvalue(expr); --xltrace; return(expr); } Figure 13: Xleval routine from 022.li in SPECint95, annotated with static branch probabilities. Extensive predication makes it impractical to implement an out-of-order version of IA64. In the example in Figure 14, the ADD in line [d] reads either the value of x created by the MOV instruction in line [b] or the value created by the predicated ADD instruction in line [c]. CMP.EQ p1,p2 = a,0 [a] MOV x = 1 [b] (p1) ADD x = 1,t [c] ADD y = x,1 [d] Figure 14: Predication makes it impractical to implement out-of-order IA64 In an out-of-order processor, each instruction that writes a result is assigned a new physical register. In Figure 14, assume the MOV in line b writes physical register xv1 and the ADD in line c writes physical register xv2. The ADD in line d will read physical register xv2, independent of the value of the predicate p1. For the predicated instruction to work correctly in an out-of-order processor, the instruction must either write a new value to this physical register, or pass along the old value. In essence, the predicated instruction must behave like the C select operation. xv2 = p1 ? add 1,t : xv1 This requirement adds an additional input to all of the potentially predicated instructions. It will increase the size of the front-end of the processor pipeline by 50%. Control speculation Control speculation refers to the compiler moving instructions above a branch. This permits long latency operations such as cache-missing loads to begin earlier in the program. Consider the possible translations of if (a) x = *b + 1 given in Figure 15. Column 1 presents a straightforward translation of the statement. The other columns present versions of the code where the LDL is moved above the branch. if (a) x=*b+1 Alpha IA64 Alpha Alpha LD4.S t=[b] LDL t,(b) LDL r31,(b) ... ... ... CMP.EQ p1,p2=a,0 BNE a,.+3 BNE a,.+2 BNE a,.+2 (p2) BR .+2 CMP ER,EXC,t0 LDL t,*b LDL t,(b) CHK.s t BNE t0, fixup ADD t,1,x ADD t,1,x ADD x=1,t ADD t,1,x (1) (2) (3) (4) Figure 15: Possible translations of if (a) = *b+1 Note that when an instruction is moved above a branch, it may now be executed when it should not be. In the original program, the instruction would only be executed when control reached the block containing the instruction, its home block. The transformed program should continue to execute as if the instruction was still in its home block, but with the performance advantage of issuing the instruction earlier. The compiler must be careful about the side effects the instruction has on registers, memory, and exceptions. For registers, the compiler can simply ignore the written register if the program branches away from the home block of the load. Memory side effects can be avoided by not speculating stores. Avoiding exceptions requires more support. Consider, in Figure 15, if both a and b are zero. If we move the load above the branch, we will get a memory exception for dereferencing location zero; if we do not move the load, we will branch around it. To address this problem, IA64 has introduced two new instructions: a speculative load and a speculation check. On an exception, the speculative load sets a 65th bit in the result register and ignores the exception. The check instruction checks the 65th bit of the register, and if it is set, signals the exception. This permits the exception to be deferred until control reaches the home block of the load, or dismissed if control never reaches this block. Column 2 in Figure 15 illustrates the use of the speculative load (indicated by the .s qualifier) and the speculative check (CHK.s) instructions. For Alpha, we have implemented a similar system in software [5]. Speculative loads are identified using a PC-map. When a speculative load generates an exception, the system software sets an assigned bit in a reserved exception register and returns; this corresponds to setting the 65th bit in the result register in the IA64 design. In the home block, a traditional compare instruction checks if the bit corresponding to this speculative load is set in the exception register and branches to a handler if necessary. Column 3 in Figure 15 illustrates the use of this scheme. We have done extensive experimentation with software-directed control speculation. We have discovered that an out-of-order processor with a large instruction window can find most of the instruction-level parallelism in the program if all memory references are cache hits; the problem is cache misses. An attractive solution to improving instruction-level parallelism in programs is to use prefetching. We can ensure a reference is a cache hit by prefetching it. A prefetch cannot generate an exception, so there is no need for a checking instruction in the home block. In addition, a prefetch will fetch a 64-byte cache line; multiple prefetches to the same cache line can be merged into a single instruction. And a prefetch does not require a register to hold a result; register pressure is reduced. In summary, prefetching on an out-of-order processor provides all of the benefits of control speculation with fewer instructions, and thus a smaller instruction cache footprint. Prefetching directly attacks memory latency, which is one of the major limits to high instruction-level parallelism. Compilation techniques for exploiting control speculation require code duplication. Superblocks (or hyperblocks) must be built to create straight-line paths through the code [7,8]. Figure 16 shows a simple flow graph where each Bn represents a basic block, and the edges are possible control flow between the blocks. Assume the frequently traveled path through the graph is B0-B1-B3-B5-B6. To isolate this path for control speculation, the superblock algorithm will create copies of blocks B3, B5, and B6. If all blocks are the same size, this is a 30% increase in code size. This increase is on top of the 33% code size increase due to the IA64 instruction encoding. IA64 will not be competitive on commercial applications, where code size has a first order effect on performance. B0 B0 / \ | \ B1 B2 B1 -B2 \ / | | B3 B3 B3 / \ | \ / \ B4 B5 B5 B4 B5 \ / | \ / B6 B6 B6 flow graph superblock Figure 16: Forming superblock for hot path B0-B1-B3-B5-B6. Data speculation Data speculation refers to the compiler moving a load above a possibly conflicting store. Often in programs with pointers, distinct pointer references rarely point to the same memory location, but the compiler cannot prove this. The IA64 provides some architectural support to permit a load to move above a store, and then, after the store occurs, check if the two pointers overlapped. Figure 17 illustrates the problem. To improve the performance of the program, we would like to schedule the potential cache-missing load of *q above the store to *p. IA64 introduces two new instructions to do this: an advanced load and a checking load; we can see their use in column 2. The advanced load (indicated by the .a qualifier) will both perform a load and create an entry in an advanced load table (ALAT). Every store in an IA64 will check its address with the ALAT and clear any matching entry. The checking load (indicated by the .c qualifier) will check the ALAT. If the entry for the matching advanced load remains in the table, then no intervening store to the same address has occurred, and the checking load is performed. If the entry for the matching advanced load has been removed from the table, the store to the same address may have occurred, and the checking load reissues the load. The same effect can be created on Alpha by explicitly checking if the two pointers are equal; see column 3 on Figure c. If the two pointers are aligned, then we do not need to replay the load, but can move the value of the store to the destination register of the load. *p = x y = *q Alpha IA64 Alpha Alpha LD8.a y=[q] LDQ y,(q) LDL r31,(q) ... ... ... STQ x, (p) ST8 [p]=x STQ x,(p) STQ x,(p) LDQ y, (q) LD8.c y=[q] CMPEQ x,y,t LDQ y,(q) CMOVNE t,x,y (1) (2) (3) (4) Figure 17: Possible translations of a store to *p followed by a load from *q. We invented a similar address-checking technique when Alpha was first designed [1]. However, we determined that this problem is best solved by the processor. At run-time, we know the pointer values and can track when conflicts occur and how they vary over time. A run-time predictor [9] can accurately predict which loads and stores conflict; it permits the processor to freely reorder the references that are unlikely to conflict. We can achieve better performance than IA64, because our run-time predictor enables us to determine more accurately when it is profitable to move a load above a store. And we will not require any extra checking instructions, so our instruction cache footprint will be smaller. In cases where the load from *q frequently misses the cache, it will be beneficial to add a prefetch instruction, as shown in column 4, Figure 17. Function calls IA64 has added register windows, similar to those in the SPARC architecture, to support function calls. The 128-entry general-purpose register file is partitioned into a 32 entry global file and a 96 entry stacked file. On entry to a procedure, an ALLOC instruction is executed that creates a new register stack frame of up to 96 entries; on return, the caller's register stack frame is restored. It appears to the compiler that there is an unlimited stack of physical registers. Of course, there is a bounded number of physical registers on the processor, and they are written out to memory as necessary by a register stack engine (RSE), without explicit program intervention. It is ironic that in an architecture designed to give compiler control of the processor, the compiler is not given control of saving and restoring registers. This is one area where compiler technology has a proven record of success [10]. Only a handful of registers need to be saved and restored across a procedure call, and the saves and restores can typically be placed at locations where they do not conflict with the other computation. On an IA64, the register save engine may be triggered at any time. Also note it will save all registers, not simply those that are live across a function call; much of the saving and restoring done by the register save engine will be unnecessary. To make effective use of the register stack, an IA64 implementation will need to have a large on-chip physical register file that can hold multiple register frames. The register file is one of the most valuable resources on the chip; it must be carefully designed to deliver fast, multi-ported access. However, on an IA64, most of the physical register file will sit idle, since only one window on the register file can be open. In effect, the register file is a staging buffer for saving and restoring registers to memory. Contrast this with a multi-threaded, out-of-order machine, where all the physical registers are available to the current computation, and the very expensive structure is usually fully utilized. Software pipelining IA64 has added significant architectural support for software pipelining. IA64 has introduced rotating register files and implicit predication to minimize the code size in a software-pipelined loop. This is best understood by looking at an example. In Figure 18 we present simple loop to add a constant to a vector. Figure 18, part 2, indicates how a single iteration of the loop would execute on a simple machine with 3 cycle cache-hit latency and a 3 cycle floating latency, assuming the arrays are in the data cache. If we assume this machine can issue a load, a store, and a floating-point operation in a cycle, we would like to execute an iteration of this loop every cycle. To avoid register conflicts, this requires a three-cycle kernel, as shown in part 3. To simplify the exposition, we do not show any loop control or address arithmetic, and we use the Alpha instruction set. The three-cycle kernel is the body of a loop, and it will iterate until the loop is complete. Note that the LDT f0, 0(r16) in the first cycle is consumed by the ADD f0, f30, f16 in the next iteration of the loop. This kernel runs the simple machine at peak rate, but we need to be able to enter it and exit it. We create a prolog to initiate all the computations that must be in-flight to enter the kernel at the top, and we create an epilog to complete the computations in flight when we finally exit the kernel at the bottom. Note that the code has increased in size 9 times from the simple LDT/ADDT/STT that is the basic body of the loop. 1. simple loop DO i = 1,n A(i) = T+B(i) 2. basic loop schedule cycle LDT f0, 0(r16) 0 . 1 . 2 ADDT f0, f30, f16 3 . 4 . 5 STT f16, 0(r17) 6 3. software pipelined kernel STT f16, 0(r17) ADDT f0,f30,f16 LDT f0, 0(r16) STT f17, 8(r17) ADDT f1,f30,f17 LDT f1, 8(r16) STT f18,16(r17) ADDT f2,f30,f18 LDT f2,16(r16) 4. loop with prolog and epilog prolog LDT f0, 0(r16) LDT f1, 8(r16) LDT f2,16(r16) ADDT f0,f30,f16 LDT f0, 0(r16) ADDT f1,f30,f17 LDT f1, 8(r16) ADDT f2,f30,f18 LDT f2,16(r16) kernel STT f16, 0(r17) ADDT f0,f30,f16 LDT f0, 0(r16) STT f17, 8(r17) ADDT f1,f30,f17 LDT f1, 8(r16) STT f18,16(r17) ADDT f2,f30,f18 LDT f2,16(r16) epilog STT f16, 0(r17) ADDT f0,f30,f16 STT f17, 8(r17) ADDT f1,f30,f17 STT f18,16(r17) ADDT f2,f30,f18 STT f16, 0(r17) STT f17, 8(r17) ST Tf18,16(r17) Figure 18: A software-pipelined kernel. To address this code growth, IA64 has added two features. The first is rotating registers. The kernel of the loop has three copies of the same code, with different register identifiers. By rotating the register file each iteration of the loop, so that f32 in the current iteration will be f35 in three iterations, we can write the kernel in a single copy. The second feature is rotating predicates. Note that the prolog and epilog are identical to the main kernel, except that certain operations are dropped out. On an IA64, we can achieve this same effect dynamically with predication. For example, in the prolog, we start with the full kernel, but only enable the LDTs, then the LDTs and the ADDTs, and then finally the entire kernel. When we are done with the main body of the loop, we enter the epilog by first turning off the LDTs, then turning off the ADDTs, and finally the STTs, and we are done. The code growth savings are considerable. However, software pipelining is most applicable to high performance technical computing. These applications are dominated by high trip-count loops, and even with very large loop unrolling, the instruction cache miss rate is not significant. IA64 will have smaller code than Alpha for these scientific loops, but it will not affect performance. For commercial applications with low trip count loops, software pipelining is not the best solution. A low trip count loop will be dynamically unrolled in the instruction window of an out-of-order processor. The processor will reorder the code, adjusting to dynamic events such as cache misses, which are difficult to predict at compile-time. Instructions Though most of the IA64 instructions are RISC-style, there are a few notable exceptions. No displacements. The basic IA64 addressing mode is a base address contained in a register. A surprising weakness is that the base address cannot be modified by a displacement when calculating a virtual address. A base-update mode does permit a displacement to be added to the base register after the virtual address has been computed. This forces an unattractive tradeoff when accessing multiple fields off of the same base pointer. Either a register must be dedicated to computing each field address, or one addressing register can be used with base-update addressing, which introduces a data dependency between the accesses. Figure 19 presents sequences for loading multiple fields from a structure, as may happen in a large commercial application. In column 1, the Alpha sequence incorporates the appropriate offsets from p as displacements in the load instruction and no additional address arithmetic is required. Column 2 presents the parallel IA64 code, where two explicit ADD4 instructions are required to compute the addresses of the fields before loading them. Column 3 presents the sequential IA64 code, where the second LD8 is dependent on the first LD8, and must issue in the following cycle. The IA64 code is 33-66% larger than the Alpha sequence. The IA64 architecture is not well suited to the addressing requirements of commercial applications. new_salary = p->salary + p->increase; Alpha IA64 LDQ t0, s(p) ADD4 a0=p,s ADD4 a0=p,s LDQ t1, i(p) ADD4 a1=p,i LD8 t0=[a0],i ADDQ t0,t1,ns LD8 t0=[a0] LD8 t1=[a0] LD8 t1=[a] ADD8 ns=t0,t1 ADD8 ns=t0,t1 (1) (2) (3) Figure 19: Addressing a structure with displacements. CISC instructions. The pipeline for a modern RISC processor is optimized for writing one register per instruction. IA64 has introduced many instructions where a single instruction may require more than one register write. As mentioned above, the base register in a memory reference can be updated after the memory reference occurs by adding a register or immediate to the base address. An IA64 load may write two registers: the destination register of the load and the base register. Two register writes will complicate a pipeline optimized for writing one register per instruction. It may also require an additional write port to the register file to handle a worst case that rarely happens. To support software pipelining, IA64 has introduced a very complicated branch instruction. The loop branch instruction reads and writes 3 special application registers (the rotating register base, the loop counter and the epilog counter), writes predicate register 63, and updates the PC. Implementing this instruction will complicate a pipeline optimized for writing one register per instruction. It is also tailored to a very specific style of software pipelining and cannot be used by compilers that support a different model. 4. Application performance In developing the Alpha architecture, we focus on the performance of general-purpose code, on high-performance technical computing, and on commercial server applications. Of course, application performance is determined by the computer system, not the microprocessor alone. To improve system performance, we are incorporating some additional system functionality on to the processor. ? Low-latency, high-bandwidth memory interface. The microprocessor core is increasing in speed relative to memory. The major bottleneck in systems performance is the memory interface. Future Alpha systems will bring the board-level cache and the memory controller on-chip, to reduce latency to memory and to increase bandwidth. ? Distributed shared memory (DSM). Bus-based multiprocessor systems do not scale well; in the near future, the memory requests of a single processor will exhaust the capacity of the system bus. Future multiprocessor systems will require distributed shared memory, with local memories for each processor and point-to-point connections between processors and remote memory. Future Alpha processors will directly support this organization with on-chip network interfaces. General purpose code SPECint speedup ------ ------- in-order Mips R5000 180MHz 4.82 out-of-order Mips R10000 180MHz 8.59 1.78 in-order Pentium 200MHz 5.47 out-of-order PentiumPro 200MHz 8.09 1.48 in-order Alpha 21164 600MHz 19.3 out-of-order Alpha 21264 575MHz 30.3 1.57 Figure 20: SPECint comparison of in-order and out-of-order processors at the same cycle time. Out-of-order execution is the most effective way to improve the performance of general purpose code. General purpose codes are those that are not highly parallel and do not put extraordinary demands on the memory system of the processor; they are exemplified by the SPEC integer benchmarks. Out-of-order execution was implemented in a number of microprocessors in the past 5 years, and we can examine the benefit by examining the performance improvement in comparing the performance of in-order and out-of-order implementations of the same instruction set architecture. We can see that at the same cycle time, we typically see a performance improvement of 1.5x. High Performance Technical Computing Highly parallel, bandwidth intensive programs characterize high performance technical computing (HPTC). There is a large amount of parallelism, both at the instruction-level and at the thread-level. This applications map well onto a simultaneous multithreaded processor. The performance of HPTC applications is typically limited by the memory bandwidth of the system. Most large applications require loading one double precision floating point number from memory for each flop; this is 8 bytes/flop. The SPECfp95 benchmarks are smaller versions of HPTC applications. Performance studies on previous Alpha implementations have shown SPECfp95 benchmarks spend 50% of their time stalled waiting on a data cache miss [11]. Essentially no time is spent waiting for an instruction cache miss. The next generation Alpha processors will have the fastest memory system in the industry. We will be the leaders in HPTC performance. Server applications Server applications such as online-transaction processing and web serving are large multithreaded programs. They have very small amounts of instruction-level parallelism; on the most aggressive superscalar machines they typically execute less than one instruction per cycle. However, server applications perform well on multiprocessor platforms, and they are ideally suited for a simultaneous multithreaded processor. As an example of a server application, we will consider online-transaction processing. This application has a number of distinctive characteristics: ? No loops. Our studies of databases have shown that almost 70% of the run-time is spent in procedures with loops that iterate less than 2 times on average, and that nearly all run-time is spent in procedures with loops that iterate less than 16 times [12]. An insignificant amount of time is spent in procedures containing loops with high trip counts. ? Large routines with many branches. Almost all the run-time in a database is spent in routines with more than 16 distinct branches, and more than half the time is spent in routines with more than 128 branches [12]. Even though there are many branches, they are well suited for a hardware branch predictor. Our studies have shown that the number of mispredicts running TPC-C is less than the average for the SPECint95 benchmarks [11]. ? Large memory footprint. Transaction processing has a large memory footprint, exceeding the size of on-chip caches and large board level caches. For TPC-C, the overall cache miss rate is three times the average of the SPECint95 benchmarks, and more than twice the average of the SPECfp95 benchmarks. The instruction cache miss rate is four times the average of the SPECint benchmarks, and 10 times the average of the SPECfp benchmarks. The board level cache miss rate is 10 times the average of the SPECint benchmarks, and 1.5 times the average of the SPECfp benchmarks [11]. On an earlier Alpha system, the processor was stalled 80% of the time running TPC, while delivering industry-leading performance [11]. To improve performance, we are designing future Alpha systems to do the following: Reduce the instruction cache footprint. The major goal when compiling databases is to reduce the size of the instruction cache footprint. Improving code quality to generate fewer instructions is important. Code layout techniques that rearrange the instructions in the application to make the more frequent path sequential will effectively pack each cache block with useful instructions. This packing improves performance by up to 30%. The basic instruction encoding of IA64 increases the code size by 33%. In addition, the techniques that IA64 introduces to increase instruction-level-parallelism (speculation and predication) increase the instruction cache footprint. Speculation requires that additional check instructions be introduced on the frequent path for each speculative chain of operations; large amounts of code must be introduced to recover from mis-speculation; and in addition, large amounts of code must be copied to form superblocks. Predication requires that both sides of a conditional branch must be in the instruction cache. Alpha's out-of-order execution techniques are a much better method for increasing instruction-level parallelism without increasing the instruction cache footprint. Also note that the special IA64 instructions introduced for software pipelining loops is not applicable to transaction processing, for there are no frequently executed loops. Tolerate cache misses. With out-of-order execution, an Alpha processor can adapt to a data cache miss; only the instructions that are dependent on the missing reference are delayed. In an in-order processor, all of the instructions after the first instruction dependent on the missing reference are delayed. The IA64 strategy for tolerating cache misses is to attempt to guess at compile-time what references will miss, and to schedule speculative loads to issue the references early. This approach has three drawbacks. First, in-line checks and compensation code are required; the additional instruction cache misses due to these instructions may outweigh any gains due to issuing the load early. Additionally, often the address computation of the load is on the critical path, and the load cannot be scheduled earlier. And finally, in a program like transaction processing, with irregular data structures, it is very difficult for the compiler to predict which reference will miss. The out-of-order processor, at run-time, will have this information to guide its dynamic scheduler. Increase effective pin bandwidth. To increase the effective pin bandwidth, we are moving the board level cache on to the chip. The interface between the processor and the board-level cache is the most frequently traversed interface in today's systems. By moving the cache on to the chip, we can introduce a much higher bandwidth, lower latency interface between the processor and the cache. At the same time, we will make the cache highly associative, effectively removing all conflict misses. In addition, we are adding a direct RAMbus memory controller on to the processor. RAMbus offers four times the effective pin bandwidth of a conventional SDRAM memory system. Future Alpha systems will have the highest bandwidth, lowest latency memory systems in the industry. Merced will not have an on-chip level 2 cache. Neither Merced nor McKinley will have an on-chip memory controller [13]. Increase processor-to-processor bandwidth. The scaling of current systems running transaction processing is limited by the bandwidth of the system bus. In particular, communication of shared data is a bottleneck. Future Alpha systems will replace the system bus with many point-to-point connections. This will remove the system bus as a bottleneck. Neither Merced nor McKinley will have an on-chip support for a distributed shared memory [13]. Simultaneous multithreading. Transaction processing is an explicitly parallel program with very low functional unit utilization; detailed simulations [14] have shown that a speedup of 3x is achievable with simultaneous multithreading. This remarkable result is due to two phenomena. Even though the threads are not running in lock step, they are all running the same code, and by constructive interference, the effective instruction cache miss rate of the program is reduced by 30%. With appropriate support from the operating system, multithreading introduces very few conflict misses into the data references, and the latency tolerance inherent in SMT permits the extra data cache misses introduced by the multiple threads to have a minimal effect on processor performance. For a processor with sufficient memory bandwidth, such as the next generation Alpha, simultaneous multithreading can effectively exploit the parallelism in transaction processing. IA64 does not include simultaneous multithreading. Each thread in a transaction-processing program is very sequential and includes long delays waiting for memory. The IA64 strategy of searching for instruction-level parallelism cannot find the orders of magnitude improvements available through simultaneous multithreading. In summary, future Alpha systems will directly address the performance bottlenecks in transaction processing. We expect the transaction processing throughput of an Alpha processor to increase by a factor of 10 from today's 21264 to an 8-wide 21464 with simultaneous multithreading: a factor of 2 from improved instruction-level parallelism, a factor of 2 from thread-level parallelism, and a factor of 2.5 from improvements in process technology and cycle time. In contrast, the new technologies Intel has developed for IA64 are a poor fit for the transaction processing. Their new instruction set architecture increases the code size of a program by at least 33%, and the compiler techniques required to increase instruction-level parallelism introduce additional instructions. IA64 does not have techniques for dynamically tolerating cache misses or memory latency. In fact, the IA64 techniques for tolerating memory latency create additional instructions, which will increase the instruction cache miss rate, which, in turn, will increase the memory system delays. Merced will add no features to increase effective pin bandwidth, and no IA64 processor is planning to increase processor-to-processor interconnect bandwidth. And most importantly, IA64 does not include simultaneous multithreading. IA64 is only focused on increasing instruction-level parallelism, and IA64 processors have no techniques to effectively exploit the parallelism in an explicitly parallel program. Figure 21 highlights these differences. Merced 21364 McKinley 21464 small icache footprint no yes no yes dynamically tolerate no yes no yes cache misses increase pin bandwidth . on-chip L2 cache no yes yes yes . on-chip memory controller no yes no yes increase processor-to- no yes no yes processor bandwidth simultaneous multithreading no no no yes Figure 21: Features for supporting high performance commercial applications. 5. Conclusion An Alpha processor will be able to exploit static instruction-level parallelism (discovered by the compiler at compile-time) and dynamic instruction-level parallelism (discovered by the processor at run-time). An IA64 processor will only be able to exploit static instruction-level parallelism. An Alpha processor can take advantage of the excellent compiler technology developed for IA64 and other VLIW processors; much of this technology is already implemented in the Alpha compilers. However, the Alpha compilers will be able to use these optimizations much more judiciously, avoiding excessive code growth, because the Alpha out-of-order processor can also discover instruction-level parallelism at run-time. An Alpha processor will be able to adapt to dynamic program behavior at run-time. An IA64 processor will not. An Alpha processor can adapt to memory references that miss in the cache, avoiding delays of 100 cycles or more. An IA64 processor will stall. An Alpha processor can find instruction-level parallelism when the compiler does not express it. And an Alpha processor can find instruction-level parallelism at run-time across branches, function calls, and compilation boundaries. An Alpha processor will be able to exploit thread-level parallelism. An IA64 processor will not. Most server applications are divided into multiple threads, and simultaneous multithreading permits these applications to take full advantage of the multiple execution units on the processor. An Alpha processor can use thread-level parallelism and instruction-level parallelism interchangeably, adjusting to the behavior of the application. Amdahl's law says that high performance requires speedups in both the sequential and the explicitly parallel portions of an application; an Alpha processor can deliver these speedups. An Alpha processor will deliver the highest memory bandwidth in the industry, and systems built out of Alpha processors will lead the industry in high performance technical computing. An Alpha processor will significantly outperform an IA64 processor on commercial applications. Alpha processors have addressed the main requirements of commercial applications: reducing the instruction cache footprint, tolerating unpredictable cache misses, increasing the pin bandwidth, and exploiting explicit thread-level parallelism. IA64 processors are not well designed for commercial applications. They require a large instruction cache footprint; they cannot dynamically tolerate cache misses; and they cannot exploit thread-level parallelism. In the important server markets, Alpha will outperform IA64. Bibliography 1. McKeen, Adler, Emer, Lowney, Nix, Sager. "Mechanism for enforcing the correct order of instruction execution," US patent 5,420,990. 2. McKeen, Adler, Emer, Lowney, Nix, Sager. "Apparatus and method for speculatively executing instructions in a computer system," US patent 5,421,022. 3. McKeen, Adler, Emer, Lowney, Nix, Sager. "Method and apparatus for propagating exception conditions of a computer system," US patent 5,428,807. 4. Adler, Lowney, Hobbs. "Software mechanism for accurately handling exceptions generated by instructions scheduled speculatively due to branch elimination," US patent 5,627,981. 5. Adler, Lowney, Hobbs. Software mechanism for accurately handling exceptions generated by speculatively scheduled instructions," US patent 5,634,023. 6. Cohn, Adler, Lowney. "Software mechanism for reducing exceptions generated by speculatively scheduled instructions," US patent 5,901,308. 7. W. Hwu , et al., "The Superblock: An Effective Technique for VLIW and Superscalar Compilation", The Journal of Supercomputing ,7(1/2):229-248 (1993). 8. S. Mahlke, et al., "Effective compiler support for predicated execution using the hyperblock," in Proc. of the 25th Annual Intl. Symp. on Microarchitecture, Dec. 1992. 9. G. Chrysos, et al., "Memory Dependence Prediction using Store Sets," in Proc. 25th International Symposium on Computer Architecture, pp. 142-153, Barcelona, Spain, June, 1998. 10. D. Wall, "Register Windows vs. Register Allocation," in Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation '88, pp. 67-78, Atlanta, GA, June 1988. 11. Z. Cvetanovic, et al., "AlphaServer 4100 Performance Characterization," Digital Technical Journal, 8(4):3-20, 1996. 12. R. Cohn, et al., "Optimizing Alpha Executables on Windows NT with Spike," Digital Technical Journal, 9(4):3-20, 1997. 13. L. Gwennap. Intel's Merced and IA-64: Technology and Market Forecast 1999 Edition. MicroDesign Resources, 1999. 14. J. Lo, et al., "An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors," in Proc. 25th International Symposium on Computer Architecture, pp. 39-50, Barcelona, Spain, June, 1998.