[MicroDesign Resources] [Site Index] [Search] [Contact]
----------------------------------------------------------------------------

IA-64 and Merced--What and Why

Everyone Talks About Merced, But No One's Doing Anything About It

by Peter Christy

It's very weird. 1997 may well be the year of first silicon for what is
likely to be the most significant new microprocessor family the world has
seen in a long time: Merced, the first implementation of the Intel/HP
architecture collaboration called IA-64. Yet we hear almost nothing about
IA-64 or Merced. From the computer-user side you might guess there would be
more curiosity about what these new processors might do to the performance
curve. From the software and system side, you might expect a lot more
discussion about how disruptive the evolution from the x86 family will be.

It is easy to understand why things are so quiet--Intel is suppressing
discussion as a condition of learning about IA-64 and Merced. Intel is
neither stupid nor mean-spirited. It just sees more liability than benefit
from such discussions at this time. We're sympathetic to Intel's concerns,
but we feel a broad IA-64 discussion is appropriate among the community that
will feel the impact.

The development of IA-64 signals a fundamental shift; it's the first of
several next-generation architectures designed to compensate for the
complexity of today's CPUs with tighter links between a smarter, more
advanced compiler and a simpler, faster microarchitecture.

The Potential Value of a New Architecture

A new architecture can redefine the performance roadmap, and the name of
this game is performance. Today's highly superscalar microprocessors use an
ever-decreasing percentage of their transistors to do useful work. Of the
millions of transistors in a modern CPU, remarkably few directly contribute
to adding numbers or moving data under the direction of the programs being
executed. Far too many are spent rearranging instructions and keeping track
of their original order. An architecture that reversed this trend and
invested more in real application performance would be a significant.
change. There are enough transistors available to implement many more
functional units than CPUs have today, if only we could put them to
productive use.

An increasing percentage of the logic on a microprocessor (the CPU less the
cache) is dedicated to management functions such as control of out-of-order
and speculative execution. Such logic is difficult to design and largely
implementation specific, increasing the complexity and time-to-market of the
processors. This logic also tends to span a lot of die area, creating
speed-limiting paths. A new architecture could greatly reduce the investment
in these overhead functions.

The dominant consumer of transistors in microprocessors is cache memory.
Cache is a good and easy way to use silicon, but at best, a cache lets the
CPU work at speed; it does not by itself do useful work. Architectural
enhancements can make cache fundamentally more effective.

Looking at Intel's problems specifically, a new architecture could fix the
most glaring problems in IA-32 (x86), including its 32-bit address
limitations and the awkward floating-point stack architecture.

Although we're not yet privy to any of the specific details of IA-64, we
believe that the new architecture attacks all of these problems. In short,
we think an innovative architecture can do more useful work per clock cycle
and run at faster clock rates, thereby getting onto a new and better
performance curve. Any processor that redefines performance expectations
(Digital's Alpha, for example) will have a significant impact. Such a
processor from the dominant market leader will have a profound impact.

Is IA-64 RISC, VLIW, or Something Else?

We tend to think of IA-64 as a modern-day RISC machine in some sense. The
earliest view of RISC, from John Cocke in IBM's 801 design, was based on the
idea of trading compile-time optimization against CPU complexity.
Instructions in the 801 were simple and chosen for their ability to be
implemented in high-speed logic. A large register set replaced complex
addressing modes. Delayed branch instructions reflected the nature of branch
processing and let a compiler optimize for it. The term RISC was coined,
after the nature of the instructions. For lack of a better term, I'll use
RISC here, but much more in the sense of Cocke's original concept: rely on
the compiler.

The transition to next-generation architectures such as IA-64 can be
understood as another transfer of complexity from the hardware to the
compiler. The early commercial RISC processors (the 801 was never brought to
market) took advantage of simplified instructions to pipeline instruction
execution and thus increase performance. (Contrary to popular belief, RISC
chips did not have faster clock rates than contemporaneous CISC chips; they
just introduced pipelining sooner.) Pipelined execution more than made up
for the decrease in "power" of the simplified instructions. Today's
RISC--and CISC--microprocessors have become complex again, so it's time for
another dose of the same medicine: more compile-time optimization and
simpler hardware.

We expect IA-64 to include more compile-time instruction sequencing. For
this reason, some will call it a VLIW (very long instruction word) machine.
Today's superscalar microprocessors have extensive logic to detect
parallelism in the instruction stream and initiate concurrent operations
when possible. We expect IA-64 to pass much of this burden back to the
compiler. To a much greater extent than today, the CPU will just execute the
instruction stream it is given as fast as it can, knowing the instruction
interactions are safe because the compiler made it so.

The IA-64 architecture is also likely to include predicated execution and a
larger register set. Today's microprocessors work hard to unearth data
dependencies (i.e., whether the contents of a register are stable or still
subject to unfinished calculations). With more registers, a compiler can use
different register subsets to support execution on multiple branch paths,
alleviating the need to analyze dependencies because there is less need to
reuse the registers.

To get the performance benefit of multi-issue designs without the complexity
of dynamic control logic, the compiler needs the ability to create parallel
branch-path logic. That means some enhancements to the notion of condition
codes and the use of earlier test conditions to control execution
(predicated execution) are needed. With a smart enough compiler, the result
is the same as today's out-of-order or speculative execution, except that
the mechanism is moved from the CPU to the compiler. The CPU logic gets
simpler, and a source of slow paths is removed. The processor runs faster
and does more work.

Changing Roles for Cache and Compiler

Better prefetch, store, and branch hinting are also likely. Ideally, the
compiler and processor will cooperate to anticipate the use of data, so the
cache can be managed most effectively, and to anticipate the flow of the
program, minimizing processor stalls due to unexpected branches.

At the time the program is compiled, the compiler can develop a
comprehensive model of the program logic and flow (this is used today for
comprehensive optimization). The compiler can determine quite a bit from the
logic of the program, and even more if the language supports advisories
(e.g., pragmas) from the programmer regarding the program's expected
behavior.

With today's instruction sets, the compiler has only limited ability to make
use of this information in a way that speeds execution. Much of the analysis
is thrown away and unavailable to the CPU. If the compiler can generate
hinting instructions, compiler/processor efficiency can be improved. Based
on the program flow analysis, the compiler can annotate a load or store with
a measure of urgency ("we will need this data soon" or "if you have load
bandwidth available, get this data"). Similarly, the compiler can generate a
compile-time view of branch prediction and pass this data to the CPU to
preload the branch-prediction tables.

There are more than enough transistors to provide many more function units
on a microprocessor than what we see today. Their utilization can be
scheduled at compile time if there is enough parallelism in a program to
take advantage of the hardware.

In summary, today's instruction sets make it difficult for the compiler to
advise the CPU on what is likely to happen, so processor designers add lots
of clever but bulky logic to determine program behavior on the fly. IA-64
has much room to improve this compiler/processor cooperation.

Many IA-64 Ideas Are Not New

Many of the ideas in IA-64 have been used already, but never in a mainstream
microprocessor, and certainly not in a product from a market leader.

We expect IA-64 to take full advantage of all that has been learned about
computer architecture, but many of the key concepts aren't really new.
Similar architectures were developed in the 1980s. Cydrome and Multiflow,
among others, sold VLIW minicomputers (see MPR 2/14/94, p. 18). The clock
speed of these minicomputers was limited by the low-integration component
technology and by the large physical size of the computer.

VLIW machines promised more work per clock cycle, and thus greater execution
power, than more conventional designs of the same era. These machines
worked, but soon minicomputers as a whole started to fade away, in part
because they couldn't keep up with VLSI CPU performance progress, and
because small firms couldn't afford to produce VLSI versions of their unique
architectures.

These early VLIW machines also suffered because they pushed the requirements
of compiler technology; compiling code for them could be painfully slow. It
will take at least as much work to compile a program efficiently for IA-64
as it did for a Cydrome or Multiflow computer, but the compiler will be
running on a CPU at least 100 times faster, so we don't expect compile time
to be a particular hurdle.

Parallelism Provides High Value

Microprocessor performance increases with clock speed and with the amount of
work done per cycle. We've already discussed how IA-64 helps increase clock
speed and makes caches more effective. We also believe IA-64 will bend the
price/performance curve by permitting more internal parallel execution.

Microprocessors already benefit from parallelism. Many CPUs dynamically
schedule multiple function units based on observed data dependencies in the
instruction stream. This multiple-issue mechanism clearly yields incremental
performance up to a point, but it's not obvious that going beyond today's
four-way issue is worth the effort.

We believe IA-64 will enable explicit parallelism well beyond simultaneous
four-way execution, although not with existing binaries. The parallelism
will be achieved by compilers that more fully extract parallelism from the
source code, the addition of explicit parallel notations in programming
languages, new compilers for those languages, and explicitly parallelized
programs.

Today's programming languages reflect the nature of conventional computers.
An intrinsically parallel problem (e.g., matrix operations or the processing
of a multimedia data stream) is represented as a cascade of iterated loops,
with simple, scalar computations at the center of the loops. Smart compilers
for machines with parallel hardware (e.g., supercomputers) work hard to
abstract these loops back into the parallel operations they represent to
generate efficient code for parallel or vector hardware.

But as supercomputer vendors learned, it's a lot easier to generate good
parallel code by starting with programs in which the parallelism is clearly
expressed, rather than obscured by the programmer in iterative loops that
force the compiler to rediscover it. The same wisdom will apply to IA-64:
the programs that benefit most from parallel capabilities will be the ones
written to make the parallelism clear. This will often require some thought
and effort--effort that will be rewarded with exceptional performance.

Intel and HP will provide excellent support for existing binary code through
some combination of direct hardware support and software preprocessing. We
don't expect these binaries to take much advantage of additional IA-64
parallelism. Source code that is recompiled to run natively on IA-64 should
do considerably better, but we expect some explicit parallelization of
algorithms will be needed to take full advantage of IA-64's parallel
capabilities.

The MMX additions to IA-32 (x86) are similar. MMX is a parallel-instruction
enhancement limited to operations suitable for media processing. I believe
the impact of MMX will be larger than many expect. Few programs have obvious
MMX needs, but the potential for parallelism in media algorithms is greater
than one might expect. MMX can accelerate printer drivers and rendering
engines as well as MPEG decoding. IA-64's parallelism will support a wider
set of data types and be more broadly usable than MMX, appearing in
data-access and searching applications. We don't expect such parallel
processing to accelerate most programs, but we do expect it to accelerate
many of those that need it the most.

Running Existing Binaries on IA-64

IA-64 will execute x86 and PA-RISC binaries faithfully. HP is expert at
moving binary code forward transparently, given its experience moving from
two CISC architectures to PA-RISC. Intel and Microsoft have learned a lot
about compatibility with existing x86 binaries as their CPU and OS designs
evolve. Apple's emulation of 68K code on the PowerPC and Digital's clever
FX!32 emulation clearly demonstrate the value and practicality of supporting
existing binaries.

The question of performance on existing code is an interesting one. When
Apple introduced its PowerPC Macs with 68K emulation, PowerPCs were so much
faster than 68Ks that emulated programs ran as quickly as on a fast 68K.
That was more than fast enough for Apple's customers. The same will apply to
Merced: almost all binaries will run as fast on Merced as they did on the
x86 processor for which they were written. Although x86 code will present a
tougher challenge, improvements in emulation technology will boost Merced's
performance on old applications.

Merced will have the additional benefit of being a mainstream processor.
Digital had to develop FX!32 from scratch because there isn't a great
incentive for people to recompile Windows applications to run natively on
Alpha. As IA-64 becomes broadly deployed by Intel, the incentive to produce
a native version will be much higher.

HP will be concerned with supporting PA-RISC binaries and with making
x86/Windows code run. HP has a long and distinguished history in assuring
binary compatibility as generations move forward, so it knows how to take
care of existing customers. The ability of IA-64 to provide good x86
execution is just icing on the cake for HP.

IA-64 Has Limitations

Despite the innovations and improvements in IA-64, it won't sweep away all
competitors instantly. Intel has an institutionalized way of introducing new
processors at the price and performance top and then letting them trickle
down over time. We expect the introduction of IA-64 to be no different.
Merced will be big and expensive, but will get smaller and cheaper over
time. It will take time for the software industry to understand how to take
full advantage of the new capabilities; it always does. And for most people,
a low-cost x86 PC will be more than enough by the time IA-64 is introduced.

These performance improvements will probably come with some penalty in code
density, as was the case for RISC compared with CISC. Where code density is
a key issue (e.g., ROM-based applications or very inexpensive computers) the
acceptance of IA-64 will be slow.

Finally, IA-64 programs will probably have to be reoptimized for each
different implementation of the architecture, continuing on the
processor-specific binary path the industry has been on since the System/360
or VAX promoted universal instruction codes throughout the family. The
technical issues this raises can be handled transparently (for instance, by
using a distribution format that is optimized for a specific processor at
the time the software is loaded), but it will require additional system
technology to manage.

Learning from the Past, Designing for the Future

Merced and IA-64 are coming, and their impact will be profound. There is
little doubt that significant architecture advances can and will occur.
IA-64 processors will get more work done on each clock cycle through
increased on-chip parallelism and will run at faster clock speeds because of
reduced complexity.

It seems the prevailing wisdom regarding microprocessor architecture is like
a pendulum, swinging first one way and then another. Once-simple designs
have become ever more complex. The advent of IA-64 heralds the beginning of
the swing back, shifting complexity to the compiler to create a simpler and
faster microprocessor.

----------------------------------------------------------------------------
                            [ Merced Main Page ]
----------------------------------------------------------------------------
                   Copyright � 1997 MicroDesign Resources

                            Website Administrator
                          Updated: 4 may 97 1:25 mn
               http://chipanalyst.com/mpr/merced/1017vp.html