Emulators Home Apple Macintosh Emulation Atari 8-bit Emulation Atari ST Emulation Gemulator Explorer Cross Platform Disk Access

 


Pentium 4: In Depth
Copyright (C) 2000 by Darek Mihocka
President and Founder, Emulators Inc.
Updated December 30, 2000


Introduction

How Intel Blew It

Processor Basics

Limitations of the Pentium III

Pentium 4 - Generation 7 or complete stupidity?

Analyzing the results

Updates


Introduction

According to Gateway's web site, the Pentium 4 is "the most powerful processor available for your PC". Unfortunately for most computer users, it's simply not true.

Despite a huge pavilion at COMDEX Las Vegas last month, Intel is almost mute when it comes to this "most powerful processor". Instead, Intel has been insulting the television viewer all through this Christmas shopping season with blue guy commercial pimping an almost 2 year old Pentium III processor instead of its new flagship processor. Why? Could there be... problems?

Merry Christmas, Happy 21st Century, and brace for impact! The PC industry is taking a huge leap backwards as Intel's new flagship Pentium 4 processor turns out to be an engineering disaster that will hurt both consumers and computer manufacturers for some time to come. Effects of Intel's heavily delayed Pentium 4 release and this summer's aborted high-end Pentium III release are already being felt, with sharp drops in PC sales this season, and migration to competing AMD Athlon based systems. Intel, and Intel exclusive vendors such as DELL have already suffered crippling drops in stock price due to Intel's processor woes, with each company's stock falling well over 50% in the past few months and hitting yearly lows this month. Don't say I didn't warn you folks about this back in September, I did.

As has been confirmed by other independent sources such as Tom's Hardware and by Intel engineers themselves at http://www.eet.com/story/OEG20001213S0045, and as a month of close study of the chip reveals, the Pentium 4 does not live up to speed claims, loses to months old AMD Athlon processors, and lacks some of the crucial features originally designed into the Pentium 4 spec. The only thing the chip does live up to is the claim that it is based on a new redesigned architecture - something that Intel does every few years, but usually to increase speed. The new architecture has serious fatal flaws that in some cases can throttle the speed of a 1.5 GHz Pentium 4 chip down to the equivalent speed of a mere 200 MHz Pentium MMX chip of 4 years ago, even slower than the level of any Celeron, Pentium II, or Pentium III chip ever released! It's a huge setback for Intel.

As Tom's Hardware points out, in some case a massive rewrite of the code does put the Pentium 4 on top, barely, over the Athlon, but for most "bread and butter" code it losses to the Athlon. As I've found out this past month in rewriting both FUSION PC and SoftMac 2000 code, no amount of code re-writing can make up for the simple fact that the Pentium 4 contains serious defects in design and implementation. Other developers who have followed Intel's architectures and optimization guidelines and optimized their code for the Pentium II and III will also find that no amount of rewriting will make their code run faster on the Pentium 4 than it currently runs on the Pentium III. And this is not the fault of the developer. In cutting corners to rush to release the Pentium 4 as soon as possible, Intel made numerous cuts to the design to reduce transistor counts, reduce die size, reduce manufacturing costs, and thus get a product out the door. And in the process crippled the chip.

What's worse, popular compiler tools, such as Microsoft's Visual C++ 6.0 are still producing code optimized for obsolete 486 and Pentium classic processors. They haven't even caught up to Pentium III levels yet. Since most developers do not write in low-level machine language as I do, most Windows software released for the next year or two will not be Pentium 4 (or even Pentium III) optimized. Far from it. Since Microsoft traditionally takes about 3 to 5 years from the release of a processor to the time when their compiler tools are optimized for that processor, it will be a long wait for all of us waiting for Intel and Microsoft to get things right.

Good news is, the AMD Athlon processor is still the fastest x86 processor on the planet and works around many of the problems in Microsoft's compilers and Intel's flawed Pentium III and Pentium 4 designs. If you weren't an AMD fan before, you will be after you read what I have to say.

What happened? In an attempt to regain the coveted PC processor speed crown which the Intel Pentium III lost to the AMD Athlon in late 1999, Intel seems to have lost all sense of reason and no longer allows engineering to dictate product design. Under pressure from stock holders to prop up its sagging stock price, and under pressure from PC users to deliver a chip faster than the AMD Athlon, Intel made two serious back-to-back mistakes in 2000 trying to rush chips out the door:

Consider that Intel stock was over $70 a share just 4 months ago, prior to these two mishaps. On the last day of trading in 2000, Intel stock dipped below $30, having failed to beat AMD all year. Yes, the whole industry is down this year. Dell is down. Gateway is down. Microsoft is down. But consider how much damage Intel made to its own credibility and to the credibility of the whole market by launching one dud after another? Not to mention its confused and three-progned battle against AMD, not quite sure whether to keep pushing Pentium III, go forward with Pentium 4, or switch all efforts to the new IA64 Itanium architecture. Their engineers are spread out three ways right now.

What it boils down to is this - just like at Microsoft and just like at Apple, the marketing scumbags at Intel have prevailed and pushed sound engineering aside. With the 1.13 GHz Pentium III chip dead on arrival, and the Pentium 4 crippled beyond easy repair, Intel may have just set itself back a good 3 to 5 years. Don't get me wrong, I've liked Intel's processors for years. I rode their stock up when their engineers were allowed to innovate. After all, they invented the processor that powers the PC. For almost a decade the 486 and Pentium architectures have been superior to any competitors' efforts - better and faster than the AMD K5 and K6 chips, far more backward compatible than Motorola 68K and PowerPC chips, and almost as fast or faster than the previous generation of chips they replaced. But, as past history shows, it takes an Intel or an AMD or a Motorola a good 3 to 5 years to design a new processor architecture. And when you blow it, you blow it. You sit in second place for those next 3 to 5 years. Pentium III has no future. Pentium 4 needs to be redesigned. Itanium is still not ready and will require all-new operating systems, computer tools, and application software.

What users get today, buying either the 1.4 or 1.5 GHz systems from DELL or Gateway or whoever, is an over-priced, under-engineered, and very costly computer. A basic 1.5 GHz Pentium 4 computer runs for well over $3000, while comparable Athlon and Pentium III based systems literally cost 1/3 to 1/2 as much. Given the price the PC manufacturers pay Intel for the Pentium 4 chip (a few hundred dollars more than the Pentium III), and given the $1000 to $2000 premium consumers pay for Pentium 4 systems, the only ones who benefit from the Pentium 4 are the PC manufacturers themselves! That is, if people will be stupid enough to fall for it.

The Pentium 4 fails miserably on all counts. In terms of speed and running existing Windows code, the Pentium 4 is as slow or slower than existing Pentium III and AMD Athlon processors. In terms of price, an entry level Pentium 4 system from DELL or Gateway sells for about double the cost of a similar Pentium III or AMD Athlon based system, with little or no benefit to the consumer. And most sadly of all, from the engineering viewpoint, the Pentium 4 design is very disappointing and casts serious doubts on whether any intelligent life exists in Intel's engineering department. After a month of using them, I was so disgusted with the two Pentium 4 machines I purchased in November that both machines have since been returned to DELL and Gateway. I personally own dozens of PCs and hundreds of PC peripherals, and never have I been so disgusted with a product (and the way it is marketed) as to return it.

Both DELL and Gateway falsely advertise their Pentium 4 based systems as somehow being superior or better than their Pentium III and/or Athlon based systems. The only thing that is superior is the price. I urge all computer consumers to BOYCOTT THE PENTIUM 4 and BOYCOTT ALL INTEL PRODUCTS until such time as Intel redesigns their chips to work as advertised. If you have already purchased a Pentium 4 system and sadly found out that it doesn't work as fast as expected, RETURN IT IMMEDIATELY for a refund.

In hindsight, it is not surprising then that prior to the November 20th launch of the Pentium 4, Intel delayed the chip numerous times, and Intel, DELL, Gateway, and COMPAQ all warned of potential earnings problems in the coming quarter, probably knowing full well of the defects in the Pentium 4. Remember, the engineers at those companies have had Pentium 4 chips to play with for several months prior to launch.

It is also not surprising that a week before the Pentium 4 launch, at COMDEX Las Vegas the week of November 13th 2000, neither Intel nor Gateway, who both had huge displays at that show, would give much information about the Pentium 4 systems. Not price. Not speed. Not specs. While Intel did display Pentium 4 based computers, they were locked up during show hours and not available for inspection by the general public. At Gateway's booth, many of the salespeople appeared ignorant, apparently not even aware that the Pentium 4 was being launched the following week. Even DELL, usually a big exhibitor at these shows (hell, they were half the show at the Windows 2000 launch in February), chose to pull their show exhibit completely, holding only closed door private sessions with the press. Shareholders and software developers and the public were barred from these secret meetings. Why?

Don't be a sucker. Don't buy a Pentium 4 based computer. Do as we have suggested here at Emulators for over a year. If you need a fast inexpensive PC, buy one that uses an inexpensive Intel Celeron processor if you must. If you require maximum speed, buy one based on the AMD Athlon. Under no circumstances should you purchase a Pentium II, Pentium III, or Pentium 4 based computer! In fact, with the cheaper AMD Duron now available to rival the Celeron, it makes more sense to boycott Intel completely. Buy AMD based systems. AMD has worked hard to outperform Intel and they deserve your business!


How Intel Blew It

Before I start the next section and get very technical, I'll explain briefly how over the past 5 years Intel dug itself into the hole it is in now. When you understand the trouble Intel is in, their erratic behavior will make a little more sense.

Let's go back 2 or 3 years, back to when the basic Pentium and Pentium MMX chips were battling with AMD's K6 chips. AMD knew (I'm sure) that it had inferior chips on its hands with the K6 line. With the goal of producing a chip truly superior than anything from Intel, its engineers went back to the drawing board and designed a chip architecture from scratch they codenamed the "K7". Which in late 1999 was released as the AMD Athlon processor. It took 5 years of work, but they hit their goal. Faster than the best Pentium III at the time, the Athlon delivered 20% to 50% more speed at only slightly higher cost than the basic Pentium III. Mission accomplished!

Intel on the other hand, not content with 90% market share, focused not on FASTER chips, but on CHEAPER SLOWER chips. Monopolistic actions, much like Microsoft's, designed not to deliver a better product to the consumer but rather to wipe out the competition. The Pentium II, while easily the fastest chip on the market at the time, was also more expensive than the AMD K6 and its own Pentium chips. And thus started a comical series of brain dead marketing blunders:

In other words, Intel succeeded so well at producing a low cost version of the Pentium II, that it not only put the AMD K6 to shame, it also killed off the Pentium II and was forced to fraudulently remarket the chip as the Pentium III! For all intents and purposes, the Pentium II, the Celeron, and the Pentium III are ONE AND THE SAME CHIP. They're based on the same P6 architecture, with only things like clock speed and cache size to differentiate the chips. This is why we tell you not to purchase a Pentium II or Pentium III based system. If you must buy Intel, buy a Celeron. Same chip, lower cost.

Sure, sure, the Pentium III has new and innovative features, like, oooh, a unique serial number on each chip. Well guess what? The serial number idea was so poorly received, and rightfully so, that the serial number is already dead. The Pentium 4 has no such feature. The new MMX instructions, renamed SSE to sound more important, are still not supported by most compilers.

What Intel FAILED TO DO during these past 5 years is it failed to anticipate that the end of the line of the P6 architecture would come as quickly as it did. It hits an upper limit around 1 GHz and cannot compete with faster AMD chips which people already have running over-clocked in the 1.5 GHz range.

Here is how and why Intel REALLY blew it. Intel has known since the Athlon first came out in 1999 that its P6 architecture was doomed. Intel was already well under way to developing the Pentium 4. Remember, these chips take 3 to 5 years to design and implement and it had already been 3 years since the P6 architecture was launched. Intel had about two more years of work left, but that meant losing badly to the Athlon for those next two years.

So instead of focusing on engineering - doing what AMD did and biting the bullet while it developed the new chip - Intel went ahead and first tried to ship a faster Pentium III chip. That back fired. So as a last resort they pulled another Celeron-type stunt and shipped a crippled Pentium 4 chip that cut so many features as to result in a chip that is neither fast nor cheap and benefits no one but greedy computer makers.

I've been studying Intel's publicly available white papers on the Pentium 4 for the good part of 6 months now, and while the chip looked promising on paper, the actual first release of the chip is a castrated version at best of the ideal chip that Intel set out to design. Intel selectively left out important implementation details of the Pentium 4, which they finally revealed in November with the posting of the Intel Pentium 4 Processor Optimization manual on their web site.

In an attempt to cover up their design defects, and with no back up plan in place (since the demise of the 1+ GHz Pentium III chip) Intel has been forced to carefully word their optimization document. I encourage all software developers and technically literate computer users to download the Pentium 4 optimization manual mentioned above, and to also for comparison to download and study the Pentium III manuals as well as the AMD Athlon manual. It does not take a rocket scientist to read and compare the three sets of processors to realize what the design flaws in the Pentium 4 are.

This is not a simple Pentium floating point bug that can be fixed by replacing the processor. This is not a 486SX scam where Intel was selling crippled 486DX chips as SX chips and then selling you a second processor (a real 486DX) as an upgrade. No, in both those past cases the defective chip still delivered the true speed performance advertised. One was simply the result of a minor design error while the other was a marketing scam, but in the end, the chips lived up to spec. And both chips could be replaced with working chips.

In the case of the Pentium 4, the chip contains design flaws which aren't easily fixed, and it is marketed fraudulently since the speed claims are pulled out of thin air. No quick upgrade or chip fix exists to deliver the true performance that the Pentium 4 was supposed to have. Users will have to wait another year or two while Intel cranks out new silicon which truly implements the full Pentium 4 spec and fixes some of the glaring flaws of the Pentium 4 design.

If you do not have a good technical background on Pentium processors, I recommend you read my Processor Basics section. It will give you a good outline of the history of PC processors over the past 20 years and will allow you to read and understand most of the Intel and AMD processor documents. You have to have at least a basic understanding of the concepts in order to understand why the Pentium 4 is the disaster that it is.

Or if you're a geek like me, skip right ahead to the Pentium 4 - Generation 7 section.


Processor Basics - the various generations of processors over the past 20 years

Generation 1 - 8086 and 68000

In the beginning, the computer dark ages of two decades ago, there was the 8086 chip, Intel's first 16-bit processor which delivered 8 16-bit registers and could manipulate 16 bits of data at a time. It could also address 16-bit of address space at a time (or 64K, much like the Atari 800 and Apple II of the same time period). Using a trick known as segment registers, a program could simultaneously address 4 such 64K segments at a time and have a total of 1 megabyte of addressable memory in the computer. Thus was born the famous 640K RAM limitation of DOS, since the remaining 384K was used for hardware and video.

A lower cost and slower variant, the 8088, was used in early PCs, providing only an 8-bit bus externally to limit the number of pins on the chip and reduce costs. As I incorrectly stated here before, the 8086 was not used in the original IBM PC. It was actually the lower cost 8088.

The original Motorola 68000 chip, while containing 16 32-bit registers and being essentially a 32-bit processor, used a similar trick of having only 16 external data pins and 24 external data pins to reduce the pin count on the chip. An even smaller 68008 chip, addressed only 20 bits of address space externally and had the same 1 megabyte memory limitation as the 8086.

While these first generation processors from Intel and Motorola ran at speeds of 4 to 8 MHz, they each required multiple clock cycles to execute any given machine language instruction. This is because these processors lacked any of the modern features we know today such as caches and pipelines. A typical instruction to 4 to 8 cycles to execute, really giving the chips an equivalent speed of 1 MIPS (i.e. 1 million instructions per second).

Generation 2 - 80286 and 68020

By 1984, Intel released the 80286 chip used in the IBM AT and clones. The 80286 introduced the concept of protect mode, a way of protecting memory so that multiple programs could run at the same time and not step on each other. This was the base chip that OS/2 was designed for and which was also used by Windows/286. The 286 ran at 8 to 16 MHz, offering over double the speed of the original 8086 and could address 16 megabytes of memory.

Motorola meanwhile developed the 68020, the true 32-bit version of the 68000, with a full 32-bit data bus and 32-bit address bus capable of addressing 4 gigabytes of memory.

By the way, both companies did release a "1" version of each processor - the 80186 and 68010 - but these were minor enhancements over the 8086 and 68000 and not widely used in home computers.

Generation 3 - 80386 and 68030

The world of home computers didn't really become interesting until late 1986 when Intel released its 3rd generation chip - the 80386, or simply the 386. This chip, although almost 15 years old now, is the base on which OS/2 2.0, Windows 95, and the original Windows NT run on. It was Intel's first true 32-bit x86 chip, extending the registers to a full 32 bits in size and increasing addressable memory to 4 gigabytes. In effect, catching up to the 68020 in a big way, by also adding things like paging (which is the basis of virtual memory) and support for true multi-tasking and mode switching between 16-bit and 32-bit modes.

The 386 is really the chip, I feel, that put Intel in the lead over Motorola for good. It opened the door to things like OS/2 and Windows NT and Linux - truly pre-emptive, multi-tasking, memory protected operating systems. It was a 286 on steroids, so much more powerful, so much faster, so much more capable than the 286, that at over $20,000 a machine, people were dying to get their hands on them. I remember reading the review of the first Compaq 386 machine, again, a $20,000+ machine that today you can buy for $50, and the reviewer would basically kill to get one.

What made the 386 so special? Well, Intel did a number of things right. First they made the chip more orthogonal. What that means is that they extended the machine language instructions so that in 32-bit mode, almost any of the 8 32-bit registers could be used for anything - storing data, addressing memory, or performing arithmetic operations. Compare this to the 8086 and 80286 whose 16-bit instructions could only use certain instructions for certain operations. The orthogonality of the 386 registers made up for the extra registers in the Motorola chips, which specifically had 8 registers which could be used for data and 8 for addressing memory. While you could use an address registers to hold data or use data registers to address memory, it was most costly in terms of clock cycles.

The 386 allowed the average programmer to do away with segment registers and 640K limitations. In 386 protect mode, which is what most Windows, OS/2, and Linux programs run in today, a program has the freedom to address up to 4 gigabytes of memory. Even when such memory is not present, the chip's paging feature allows the OS to implement virtual memory by swapping memory to hard disk, what most people know as the swap file.

Another innovation of the 386 chip was the code cache, the ability of the chip of buffer up to 256 bytes of code on the chip itself and eliminate costly memory reads. This is especially useful in tight loops that are smaller than 256 bytes of code.

Motorola countered with the 68030 chip, a similar chip which added built-in paging and virtual memory support, memory protection, and a 256 byte code cache. The 68030 also added a pipeline, a way of executing parts of multiple instructions at the same time, to overlap instructions, in order to speed up execution.

Both the 386 and 68030 ran at speeds ranging from 16 MHz to well above 40 MHz, easily bringing the speed of the chips to over 10 MIPS. Both chips still required multiple clock cycles to execute even the simplest machine language instructions, but were still an order of magnitude than their first generation counterparts. Microsoft quickly developed Windows/386 (and later OS/2 2.0 and Windows NT) for the 386, and Apple added virtual memory support to Mac OS.

Both chips also introduced something known as a barrel shifter, a circuit in the chip which can shift or rotate any 32-bit number in one clock cycle. Something used often by many different machine language instructions.

The 386 chip is famous for unseating IBM as the leading PC developer and for causing the breakup with Microsoft. IBM looked at the 386, decided it was too powerful for the average user, and decided not to use it in PCs and not to write operating systems for it. Instead it chose to keep using the 286 and to develop OS/2 for the 286. Microsoft on the other hand developed Windows/386 with improved multitasking, Compaq and other clone makers did use the 386 to deliver the horsepower needed to run such a graphical operating system, and the rest is history. By the time IBM woke up, it was too late. Microsoft won. Compaq DELL and Gateway won.

Generation 4 - 486 and 68040

This generation is famous for integrating the floating point co-processor, previously a separate external chip, into the main processor. This generation also refined the existing technology to run faster. The pipelines on the Intel 486 and Motorola 68040 were improved to in effect give the appearance of 1 clock cycle per instruction execution. 20 MIPS. 25 MIPS. 33 MIPS. Double or triple the speed of the previous generation with virtually no change in instruction set! As far as the typical programmer or computer user is concerned, the 386 and 486, or 68030 and 68040, were the same chips, except that the 4th generation ran quicker than the 3rd. And speed was the selling point and the main reason you upgraded to these chips.

The way these chips exploited speed was in a number of ways. First, the caches were increased in size to 8K, and made to handle both code and data. Suddenly relatively large amounts of data (several thousands bytes) could be manipulated without incurring the costly penalty of accessing main memory. Great for mathematical calculations and other such applications. This is why many operating systems today and many video games don't support anything prior to the 4th generation. Mac OS 8 and many Macintosh games require a 68040. Windows 98, Windows NT 4.0, and most Windows software today requires at least a 486. The caches made that huge a difference in speed! Remember this for later!

With the ability to read memory in a single clock cycle now came the ability to execute instructions in a single clock cycle. By decoding one instruction while finishing the execution of the previous instruction, both the 486 and 68040 could give the appearance of executing 1 instruction per cycle. Any given instruction still takes multiple clock cycles to execute, but by overlapping several instructions at once at different stages of execution, you get the appearance of one instruction per cycle. This is the job of the pipeline.

Keeping the pipeline full is of extreme importance! If you have to stop and wait for memory (i.e. the data or code being executed isn't in the cache) or you execute a complex instruction such as a square root, you introduce a bubble into the pipeline - an empty step where no useful work is being done. This is also known as a stall. Stalls are bad. Remember that.

One of the great skills of writing assembly language code, or writing a compiler, is knowing how to arrange the machine language instructions in such an order so that the steps you ask the processor to perform are done as efficiently as possible.

The rules for optimizing code on the 486 and 68040 are fairly simple:

The techniques used in the 4th generation are very similar to techniques used by RISC (reduced instruction set) processors. The concept is to use as simple instructions as possible. Use several simple instructions in place of one complex instructions. For example, to multiply by 2 simply add a value to itself instead of forcing the chip to use its multiply circuitry. Multiply and divide take many clock cycles, which is fine when multiplying by a large number. But if you simply need to double a number, it is faster to tell the chip to add two numbers than to multiply two numbers.

Another reason to follow the optimization rules is because both the 486 and 68040 introduced the concept of clock doubling, or in general, using a clock multiplier to run the processor internally at several times the speed of the main computer clock. The computer may run at say, 33 MHz, the bus speed, but a typical 486 or 68040 chip is actually running at 66 MHz internally and delivering a whopping 66 MIPS of speed.

The year is now 1990. Windows 3.0 and Macintosh System 7 are about to be released.

Generation 5 - the Pentium and PowerPC

With the first decade and the first 4 generations of chips now in the bag, both Motorola and Intel looked for new ways to squeeze speed out of their chips. Brick walls were being hit in terms of speed. For one, memory chips weren't keeping up with the rapidly increasing speed of processors. Even today, most memory chips are barely 10 or 20 times faster than the memory chips used in computers two decades ago, yet processor speeds are up by a factor of a thousand!

Worse, the remaining hardware in the PC, things like video cards and sound cards and hard disks and modems, run at fixed clock speeds of 8 MHz or 33 MHz or some sub multiple of bus speed. Basically, any time the processor has to reference external memory or hardware, it stalls. The faster the clock multiplier, the more instructions that execute each bus cycle, and the higher the chances of a stall.

This is why for example, upgrading from a 33 MHz 486 to a 66 MHz 486 only offers about a 50% speed increase in general, and similarly when upgrading from the 68030 to the clock doubled 68040.

It's been said many times by many people, but by now you should have realized that CLOCK SPEED IS NOT EVERYTHING!!

What can affect speed far more than mere clock speed is the rate at which the chip can process instructions. The 4th generation brought the chip down to one instruction per clock cycle. The 5th generation developed the concept of superscalar execution. That is, executing more than one instruction per clock cycle by executing instructions in parallel.

Intel and Motorola chose different paths to achieve this. After an aborted 68050 chip and short lived 68060 chip, Motorola abandoned its 68K line of processors and designed a new chip based on IBM's POWER RISC chip. A RISC processor (or Reduced Instruction Set) does away with complicated machine language instructions which can take multiple clock cycles to execute, and replaces them with simpler instructions which execute in fewer cycles. The advantage of this is the chip achieves a higher throughput in terms of instructions per second or instructions per clock cycle, but the down side is it usually takes more instructions to do the same thing as on a CISC (or Complex Instruction Set) processor.

The theory with RISC processors, which has long since proven to be bullshit, was that by making the instructions simpler the chip could be clocked at a higher clock speed. But this in turn only made up for the fact that more instructions were now required to implement any particular algorithm, and worse, the code grew bigger and thus used up more memory. In reality a RISC processor is no more or less powerful than a CISC processor.

Intel engineers realized this and continued the x86 product line by introducing the Pentium chip, a superscalar version of the 486. The original Pentium was for all intents and purposes a faster 486, executing up to 2 instructions per clock cycle, compared to the 1 instruction per cycle limit of the 486. Once again, CLOCK SPEED IS NOT EVERYTHING.

By executing multiple instructions at the same time, the design of the processor gets more complicated. No longer is it a serial operating. While earlier processors essentially followed this process:

a superscalar processor how has additional steps to worry about

The extra check are necessary to make sure that the code executes in the correct order. If two ADD operations follow one another, and the second ADD depends on the result of the first, the two ADD operations cannot execute in parallel. They must execute in serial order.

Intel gave special names to the two "pipes" that instructions execute in - the U pipe and the V pipe. The U pipe is the main path of execution. The V pipe executes "paired" instructions, that is, the second instruction sent from the decoder and which is determined not to conflict with the first instruction.

Since the concept of superscalar execution was new to most programmers, and to Microsoft's compilers, the original Pentium chip only delivered about 20% faster speed than a 486 at the same speed. Not 100% faster speed as expected. But faster nevertheless. The problem was very simply that most code was written serially.

Code written today on the other hand does execute much faster, since compilers now generate code that "schedules" instructions correctly. That is, it interleaves pairs of mutually exclusive instructions so that most of the time two instructions execute each clock cycle.

The original PowerPC 601 chip similarly had the ability to execute two instructions per cycle, an arithmetic instruction pair with a branch instruction. The PowerPC 603 and later versions of the PowerPC added additional arithmetic execution units in order to execute 2 math instructions per cycle.

With the ability to execute twice as much code as before comes greater demand on memory. Twice as many instructions need to be fed into the processor, and potentially twice as much data memory is processed.

Intel and Motorola found that as clock speed was being increased in the processors, performance didn't scale, even on older chips. A 66 MHz 486 only delivered 50% more speed than a 33 MHz 486. Why?

The reason again has to do with memory speed. When you double the speed of a processor, the speed of main memory stay the same. That means that a cache miss, which forces the processor to read main memory, now takes TWICE the number of clock cycles. With today's fast processors, a memory read can literally take 100 or more clock cycles. That means 100, or worse, 200 instructions not being executed.

The way Intel and Motorola attacked this problem was to increase the size of the L1 cache, the very high speed on-chip level one cache. For example, the original 486 had an 8K cache. The newer 100 MHz 486 chips had a 16K cache.

But 8K or 16K is nothing compared to the megabytes that a processor can suck in every second. So computers started to include a second level cache, the L2 cache, which was made up of slightly slower but larger memory. Typically 256K. The L2 cache is still on the order of 10 times faster than main memory, and allows most code to operate at near to full speed.

When the L2 cache is disabled (which most PC users can do in the BIOS setup), or when it is left out completely, as Apple did in the original Power Macintosh 6100, performance suffers.

Generation 6 - the P6 architecture and PowerPC G3/G4

By 1996 as processor speeds hit 200 MHz, more brick walls were being hit. Programmers simply weren't optimizing their code and as processor speeds increased, the processors simply spent more time waiting on memory or waiting for instructions to finish executing. Intel and Motorola adopted a whole new set of tricks in their 6th generation of processors. Tricks such as "register renaming", "out of order execution", and "predication".

In other words, if the programmer won't fix the code, the chip will do it for him. The Intel P6 architecture, first released in 1996 in the Pentium Pro processor, is at the heart of all of Intel's current processors - the Pentium II, the Celeron, and the Pentium III. Even AMD's Athlon processor uses the same tricks.

What they did is as follows:

From an engineering standpoint, the enhancements in the 6th generation processors are truly amazing. Through the use of brute force (larger caches and faster clock speed), parallel execution (multiple execution units and 3 decoders), and clever interlocking circuitry to allow out-of-order execution, Intel has been able to stick with the same basic architecture for 5 years now, catapulting CPU throughput from the 100 to 150 MHz range in 1995 to over 1 GHz today. Most code, every poorly written unoptimized code, executes at a throughput of over 1 instruction per clock cycle, or roughly 1000 MIPS on today's fastest Pentium III processors.

The PowerPC G3 and G4 chips use much the same tricks (after all, all these silicon engineers went to the same schools and read the same technical papers) which is why the G3 runs faster than a similarly clocked 603 or 604 chip.


Limitations of the Pentium III

AMD, calling the Athlon a "7th generation" processor, something I don't fully agree with since they really didn't have a 6th generation processor, took the basic ideas behind the Pentium II/III and PowerPC G3 and used them to implement the Athlon. Having the benefit of seeing the original Pentium Pro's faults, they fixed many of bottlenecks of the P6 design and which even today limit the full speed of the Pentium III.

These are the same problems that Intel of course is trying to address in the Pentium 4. It helps us to understand why the AMD Athlon is a faster chip and what AMD did right to understand why Intel needed to design the Pentium 4, and that is what I shall discuss in this section.

Not counting the unbuffered segment register problem in the original Pentium Pro (which was fixed in the far more popular Pentium II chip), what are the bottlenecks? What can possibly slow down the processor when instructions are being executed out-of-order 3 at a time!?!?

Well, keep in mind that a chain is only as strong as its weakest link. In the case of the processor, each stage can be considered a link in a chain. The main memory. The L2 cache. The L1 cache. The decoder. The scheduler which takes decoded micro-ops and feeds them into the various execution units. in a the two main bottlenecks in the P6 architecture are the 4-1-1 limitation of the decoder, and the dreaded partial register stall.

If you read the Pentium III optimization document, you will see reference to the 4-1-1 rule for decoding instructions. When the Pentium III (for example) fetches code, it pulls in up to three instructions through the decoders each clock cycle. Decoder 1 can decode any machine language instruction. Decoders 2 and 3 can decode only simple, RISC-like instructions that break down into 1 micro-op. A micro-op is a basic operation performed inside the processor. For example, adding two registers takes one micro-op. Adding a memory location to a register requires two micro-ops: a load from memory, then an add. It uses two execution units inside the processors, the load/store unit on one clock cycle, and then an ALU on the next clock cycle. Micro-ops translate roughly into clock cycles per instruction but don't think of it that way. Since several instructions are being executed in parallel and out of order, the concept of clock cycles per instruction becomes rather fuzzy.

Instead, think of it like this. What is the limitation of each link? How frequently does that link get hit? Main memory, for example, may not be accessed for thousands of clock cycles at a time. So while accessing main memory may cost 100 clock cycles, that penalty is taken infrequently thanks to the buffering performed by the L1 and L2 caches. Only when dealing with large amounts of memory at a time, such as when processing a multi-megabyte bitmap, does it start to hurt.

Intel and AMD have addressed this problem in two ways. First, over they years they have gradually increased the speed of the "front side bus", the data path between main memory and the processor, to work at faster and faster clock speeds. From 66 MHz in the Celeron and Pentium II, to 100 and 133 MHz in the Pentium III, to 200 MHz in the AMD Athlon. Second, Intel produces a version of the Pentium II and III called the "Xeon", which contains up to 2 megabytes of L2 cache. The Xeon is used frequently in servers as it supports 8-way multi-processing, but on the desktop the Xeon does offer considerable speed advantages over the standard Pentium III when large amounts of data are involved. The PowerPC G4 has up to 1 megabyte of L2 cache, which explains why a slower clock speed Power Mac G4 blows away a Pentium III in applications such as Photoshop.

Basically, the larger the working set of an application, that is, the amount of code and data in use at any given time, the larger the L2 cache needs to be. To keep costs low, Intel and AMD have both actually DECREASED the sizes of their L2 caches in newer versions of the Pentium III and Athlon, which I believe is a mistake.

The top level cache, the L1 cache, is the most crucial, since it is accessed first for any memory operation. The L1 cache uses extremely high speed memory (which has to keep up with the internal speed of the processor), so it is very expensive to put on chip and tends to be relatively small. Again, from 8K in the 486 to 128K in the Athlon. But as my tests have shown, the larger the L1 cache, the better.

The next step is the decoder, and this is one of the two major flaws of the P6 family. The 4-1-1 rule prevents more than one "complex" instruction from being decoded each clock cycle. Much like the U-V pairing rules for the original Pentium, Intel's documents contain tables showing how many micro-ops are required by every machine language instructions and they give guidelines on how to group instructions.

Unlike main memory, the decoder is always in use. Every clock cycle, it decodes 1, 2, or 3 instructions of machine language code. This limits the throughput of the processor to at most 3 times the clock speed. For example, a 1 GHz Pentium III can execute at most 3 billion instructions per second, or 3000 MIPS. In reality, most programmers and most compilers write code that is less than optimal, and which is usually grouped for the complex-simple-complex-simple pairing rules of the original Pentium. As a result, the typical throughput of a P6 family processor is more like double the clock speed. For example, 2000 MIPS for a 1 GHz processor.

By sticking to simpler instruction forms and simpler instructions in general (which in turn decode to fewer micro-ops) a machine language programmer can achieve close to the 3x MIPS limit imposed by the decode. In fact, this simple technique (along with elimination of the partial register stalls) is the reason our SoftMac 2000 Macintosh emulator runs so much faster than other emulators, and why in the summer of 2000 when I re-wrote the FUSION PC emulator I was able to achieve about a 50% speed increase in the speed of that emulator in only a few days worth of work. By simply understanding how the decoder works and writing code appropriately, one can achieve near optimal speeds of the processor.

Once again, let me repeat: CLOCK SPEED IS NOT EVERYTHING! So many people stupidly go out and buy a new computer every year expecting faster clock speed to solve their problems, when the main problem is not clock speed. The problem is poorly written code, uneducated programmers, and out of date compilers (that's YOU Microsoft) that target obsolete processors. How many people still run Microsoft Office 95? Ok, do a DUMPBIN on WINWORD.EXE or EXCEL.EXE to get the version number of the compiler tools. That product was written in an old version of Visual C++ which targets now obsolete 486 processors. Do the same thing with Office 97 or Office 2000. Newer tools that target P6. Wonder why your Office 97 runs faster than your Office 95 on the same Pentium III computer? Ditto for Windows 98 over Windows 95. Windows 2000 over Windows 98. Etc. etc. The newer the compiler tools, the better optimized the code is for today's processors.

The next bottleneck are the actual execution units - the guts of the processor. They determine how many micro-ops of a given type can execute in one clock cycle. For example, the P6 family can load or store one memory location per clock cycle. It can execute one floating point instruction per clock cycle because there is only one FPU. This means that every the most optimized code, that caches perfectly, decodes perfectly, can still hit a bottleneck simply because too many instructions of the same type are trying to executing. Again, one needs to mix instructions - integer and floating point and branch, to make best use of the processor.

Finally that dreaded partial register stall! The one serious bug in the P6 design that can cause legacy code to run slower. By "legacy code" I mean code written for a previous version of the processor. See, until now, every generation so far improved on the design of previous generations. No matter what, you were almost 100% guaranteed that a newer processor, even running at the same clock speed as a previous processor, would deliver more speed. Why a 68040 is faster than a 68030. Why a Pentium is faster than a 486.

Not so with generation 6. While every other optimization in the P6 family pretty much boosts performance without requiring the programmer to rewrite one single line of code, even the 4-1-1 decode rule, the register renaming optimization has one fatal flaw that kills performance: partial registers stalls! A partial register stall is when a partial register (that is, the AL, AH, and AX parts of the EAX register, the BL, BH, and BX parts of the EBX register, etc) get renamed to different internal registers because the processor believes the uses are mutually exclusive.

For example, a C compiler will typically read an 8-bit or 16-bit integer from memory into the AL or AX register. It will then perform some operation on that integer, for example, incrementing it or testing a value. A typical C code sequence to test a byte for zero goes something like this:

int foo(unsigned char ch)
{
return (ch == 0) ? 1 : -1;
}

Microsoft's compilers for years have used a "clever" little trick with conditional expressions, and that is to use a compare instruction to set the carry flag based on the result of an expression, then to use the x86 SBB instruction to set a register to all 1's or 0's. Once set, the register can be masked and manipulated to generate any two desired resulting values. MMX code makes heavy use of this trick as well, although MMX registers are not subject to the partial register stall.

Anyway, when you compile the above code using Microsoft's Visual C++ 4.2 compiler with full Pentium optimizations (-O2 -G5) you get code the following code:

PUBLIC _foo
_TEXT SEGMENT
_ch$ = 8

; 4 : return (ch == 0) ? 1 : -1;

00000 80 7c 24 04 01 cmp BYTE PTR _ch$[esp-4], 1
00005 1b c0 sbb eax, eax
00007 83 e0 02 and eax, 2
0000a 48 dec eax

0000b c3 ret 0
_foo ENDP
_TEXT ENDS
END

and when compiled with Microsoft's latest Visual C++ 6.0 SP4 compiler you get code like this:

PUBLIC _foo
_TEXT SEGMENT
_ch$ = 8

; 4 : return (ch == 0) ? 1 : -1;

00000 8a 44 24 04 mov al, BYTE PTR _ch$[esp-4]
00004 f6 d8 neg al
00006 1b c0 sbb eax, eax
00008 83 e0 fe and eax, -2 ; fffffffeH
0000b 40 inc eax

0000c c3 ret 0
_foo ENDP
_TEXT ENDS
END

Notice in both cases the use of the SBB instruction to set EAX to either $FFFFFFFF or $00000000. Internally the processor reads the EAX register, subtracts it from itself, then write out the value back to EAX. (Yes, it is stupid that when a processor subtracts a register from itself that it would read the register first, but I have verified that it does). In the VC 4.2 case, the processor may or may not stall because we don't know how far back the EAX register was last updated and whether all or part of it was updated.

But interestingly, with the latest 6.0 compiler, even using the -G6 (optimize for P6 family) flag, a partial register stall results.  AL is written to, then all of EAX is used by the SBB instruction. This is perfectly valid code, and runs perfectly fine on the 486, Pentium classic, and AMD processors, but suffers a partial register stall on any of the P6 processors. On the Pentium Pro a stall of about 12 clock cycles, and on the newer Pentium III about 4 clock cycles.

Why does the partial register stall occur? Because internally the AL register and the EAX registers get mapped to two different internal registers. The processor does not discover the mistake until the second micro-op is about to execute, at which point it needs to stop and re-execute the instruction properly. This results in the pipeline being flushed and the processor having to decode the instructions a second time.

How to solve the problem? Well, Intel DID tell developers how to avoid the problem. Most didn't listen. The way you work around a partial register stall is to clear a register, either using an XOR operation on itself, a SUB on itself, or moving the value 0 into the register. (Ironically, SBB which is almost identical to SUB, does not do the trick!) Using one of these three tricks will flag the register as being clear, i.e. zero. This allows the second use of the instruction to be mapped to the same internal register. No stall.

So what is the correct code? Something like this is correct (generated with the Visual C++ 7.0 beta):

PUBLIC _foo
_TEXT SEGMENT
_ch$ = 8

; 4 : return (ch == 0) ? 1 : -1;

00000 8a 4c 24 04 mov cl, BYTE PTR _ch$[esp-4]
00004 33 c0 xor eax, eax
00006 84 c9 test cl, cl
00008 0f 94 c0 sete al
0000b 8d 44 00 ff lea eax, DWORD PTR [eax+eax-1]

0000f c3 ret 0
_foo ENDP
_TEXT ENDS
END

Until every single Windows application out there gets re-compiled with Visual C++ 7.0, or gets hand coded in assembly language, your brand spanking new Pentium III processor will not run as fast as it can. But even then, at the expense of code size and larger memory usage. Note the extra XOR instruction needed to prevent the partial register stall on the SETE instruction. While this does eliminate the partial register stall, it does so at the expense of larger code. You eliminate one bottleneck, you end up increasing another.

Why the AMD Athlon doesn't suck

Guess what folks? The AMD Athlon has no partial register stall!!!! Woo hoo! AMD's engineers attacked the problem and eliminated it. I've verified that to be true by checking several different releases of the Athlon. That simple design fix, which affects just about every single Windows program every written, along with the larger L1 cache and better parallel execution is why the AMD Athlon runs circles around the Pentium III.

Floating point code especially, which means many 3-D games, run faster on the Athlon because the code runs into fewer bottlenecks inside the processor.

That's it, simple things that AMD did right:

Intel? Are you listening? HELLO?


Pentium 4 - Generation 7 or complete stupidity?

Let's get to the meat of it. WHY THE PENTIUM 4 SUCKS. If you've read this far I expect you have downloaded the Intel and AMD manuals I mentioned above, you're read them, and you have a good understanding of how the Pentium III, AMD Athlon, and Pentium 4 work internally. If not, start over!

You're read my previous section on the cool tricks introduced in the 6th generation processors (Pentium II, AMD Athlon, PowerPC G3) and the kinds of bottlenecks that can slow down the code:

As I mentioned, AMD to their well deserved credit attacked all these problems head on in the Athlon by detecting and eliminating the partial register stall, by relaxing limitations on the decoder and instruction grouping, and by making the L1 caches larger than ever.

So, after 5 years of deep thought, billions of dollars in R&D, months of delays, hype beyond belief, how did Intel respond?

In what can only be considered a monumental lapse in judgment, Intel went ahead and threw out the many tried and tested ideas implemented in both the PowerPC and AMD Athlon processor families and literally took a step back 5 years to the days of the 486.

It seems that Intel is taking the approach similar to that of their upcoming Itanium chip - that the chip should do less optimization work and that the programmer should be responsible for that work. An idea not unfamiliar to RISC chip programmers, but Intel really went a little too far. They literally stripped the processor bare and tried to use brute force clock speed to make up for it!

Except the idea doesn't work. Benchmark after benchmark after benchmark shows the 1.5 GHz Pentium chip running slower than a 900 MHz Athlon, and in some cases slower than a 533 MHz Celeron, even as slow as a 200 MHz Pentium in rare cases.

Intel did throw a few new ideas into the Pentium 4. The two ALUs (arithmetic logic units) which perform adds and other simple integer operations, run at twice the clock speed of the processors. In other words, the Pentium 4 is in theory capable of executing 6 billion integer arithmetic operations per second. As I'll explain below, the true limit is actually much lower and not any better than what you can get out of a Pentium III or Athlon already.

Another new idea is the concept of a 'trace cache", or what is basically a code cache for decoded micro-ops. Rather than constantly decode the instructions in a loop over and over again, the Pentium 4 does not have an L1 code cache. Instead, it caches the output of the decoder, caching the raw micro-ops. This sounds like a good idea at first, but again, in reality it does not prove any better than simply having an 8K code cache, and certainly falls short of the Athlon's 64K code cache.

The Benchmarks

As Tom's Hardware site documented last month, the Pentium 4 lost miserably against the AMD Athlon at MPEG4 video encoding. Only after Intel engineers personally modified the code did the Pentium 4 suddenly win the benchmarks. A side effect of this is that the benchmarks on the Athlon improved considerably as well, indicating that the code was very poorly written in the first place.

However, this brings up the point again that Intel now expects software developers to completely rewrite their code in order to see performance gains on the Pentium 4. And we don't all have the luxury of having an Intel engineer showing up on our doorstep to re-write our code for us. With thousands of Windows applications out there, not to mention the growing number of Linux applications out for the PC, and sadly out of date compiler tools, does Intel seriously expect millions of computer code to be rewritten just for the Pentium 4?

I downloaded the modified MPEG4 FlasK encoder and ran it through my own 150 megabyte sample video file. I used 8 similarly configured Windows Millennium computers as the test machines, which had the following processors and memory sizes and roughly sorted by cost:

The encoding times (in seconds) for the same sample piece of video were as follows:

Chip speed and type Elapsed time (seconds) Clock cycles (billions)
1.5 GHz Pentium 4 484 726
900 MHz AMD Athlon 544 490
670 MHz Pentium III 743 498
650 MHz Pentium III 757 492
533 MHz Celeron 858 457
600 MHz AMD Athlon 922 553
500 MHz Pentium III 946 473
600 MHz Crusoe 1369 821

The 1.5 GHz Pentium 4 won of course, but barely over the 900 MHz AMD Athlon at about 1/3 the price and 60% of the clock speed. Worse, the Pentium 4 fails to even cut the processing time in half compared to the much slower clocked Pentium III and Celeron systems. The Pentium 4 is barely twice as fast at this benchmark as a 500 MHz Pentium III.

This benchmark illustrates several important concepts I discussed earlier, especially when we calculate the total number of clock cycles executed on each processor. By counting the total number of cycles, it equalizes the differences in clock speed between the various systems.

First, CLOCK SPEED IS NOT EVERYTHING!! Just because one processor runs at a faster clock speed than another, does not mean you will get proportionally faster performance.

Second, the Pentium 4 seems to require almost 50% more clock cycles than the Athlon or Pentium III, indicating that either floating point operations each take more cycles on the Pentium 4, that the Pentium 4 does not execute as many floating point instructions in parallel as the Athlon, or that the Pentium 4 is being throttled by the cache or decoder. An MPEG decode does deal with a lot of data (in this case, 150 megabytes of data) and my guess is the small L1 cache on the Pentium 4 hurts it here. Intel's optimization document addresses this issue, referring to techniques known as "cache blocking" and "strip mining" to minimize the cache usage by working on small portions of data at a time. Again, something that a code rewrite is needed to implement.

For another floating point test, I ran the widely used Prime95 utility for finding Mersenne primes (see http://www.mersenne.org). Setting up the same machines I launched PRIME95 on each machine at the same time and had them begin calculating the primality of roughly the same length number several million digits long. The number being worked on requires about 24 hours of processing time. After about an hour of running time, it was clear that the Pentium 4 was neck and neck with the 900 MHz Athlon. After several hours, still tied. After 24 hours of running time both the Pentium 4 and 900 MHz Athlon completed, while the others were still part way through processing the number. I recoded the relative progress of each, with the Pentium 4 and 900 MHz Athlon being shown to have complete and roughly tied:

Chip speed and type Relative speed
1.5 GHz Pentium 4 >100% (tied)
900 MHz AMD Athlon >100% (tied)
670 MHz Pentium III 90%
650 MHz Pentium III 90%
533 MHz Celeron 60%
600 MHz AMD Athlon 99%
500 MHz Pentium III 45%
600 MHz Crusoe 60%

Here, the clear floating point dominance of the AMD Athlon over the Pentium III and Pentium 4 is evident. Since the source code to PRIME95 can be freely downloaded, I looked at it. It contains a lot of hand coded assembly code, and more importantly, a LOT OF FLOATING POINT instructions. The Athlon, with its ability to execute 3 floating point instructions per clock cycle, even at 60% the clock speed, just about keeps up with the Pentium 4. At 600 MHz speeds, the Athlon blows away a Pentium III chips running over 10% faster.

In a third floating point test, running the SoftMac 2000 emulator and then running a heavily floating point based benchmark on the Mac OS, the Pentium 4 fails to keep up with even the 600 MHz chips, losing badly (82 seconds vs. 49 seconds) against the 670 MHz Pentium III and losing worse (82 seconds vs. 36 seconds) against the 900 MHz AMD Athlon.

Running other tests using various emulators, I found that in general the Pentium 4 runs emulators such as SoftMac 2000 SLOWER in most cases than the 650 MHz Pentium III and 600 MHz AMD Athlon.

A small tangent about the Transmeta Crusoe

I should stop and mention a few things about the biggest surprise to me from COMDEX Las Vegas, and that being not the Pentium 4 chip but the Transmeta Crusoe chip.

See, the folks over at Transmeta have their own ideas how to build processors, especially for portable devices that have to restrict their power use. After about 5 years of secret development, these guys came up with a chip that works slightly differently from the Intel and AMD chips.

Rather than waste millions of transistors on a chip for out-of-order schedulers and other fancy tricks, they decided to strip all this out of the chip and eliminate about 90% of the chip's power consumption. Instead, a piece of software performs the code optimizations at run time. This is essentially the concept behind a dynamically recompiling emulator, and the concept behind a JIT (just in time) compiler such as what is found in Java.

What Transmeta has done is taken this a step further to JIT the entire Windows operating system at run time, rather than say, a tiny 100K Java applet. And they pulled it off. Using a 600 MHz chip that performs software based optimizations, I find the Picturebook consistently performing at about the speed of a 300 MHz Pentium class processor, as perfectly demonstrated by the MPEG4 results above.

In addition to that, the Crusoe chip has this peculiar side effect of running faster as time goes by. i.e. as more time elapses, the chip's JIT compiler appears to optimize the code further, and I've actually noticed this running the SoftMac emulator. As I repeat benchmarks under the Mac OS, they get slightly faster each run.

This shows up in the PRIME95 benchmark quite clearly, where after 24 hours of run time, the Crusoe keep right up with a 533 MHz Celeron chip - almost keeping up clock cycle for clock cycle with Intel's chip!

As I said a few weeks ago, hats off to the geniuses at Transmeta for pulling off such an amazing feat of emulation. This idea of software-assisted execution may well in fact be the solution for Intel's woes in future generations of chips, as it takes the burden of code optimization off the hands of millions of software developers and puts in back in the chip without requiring millions of extra transistors.


Analyzing and understanding the results

But back to the Pentium 4 and figuring out what it sucks. I finally pulled out my big gun: a custom CPU analyzer utility which I use to analyze various processors. It measures things like the sizes and speeds of the caches, and it executes hundreds of different sample code sequences in order to measures the throughput of each piece of code on each processor. These code sequences consists of code that is commonly emitted by Microsoft's Visual C++ compiler and code that is commonly found in emulation code. I've used this utility for years to hand tune my emulators to various processors and it's served me well.

After just a few minutes on the Pentium 4 it gave me the results I needed. I then read over Intel's Pentium 4 documents again and corroborated my results, in order to finally determine the fatal design flaws of the Pentium 4.

MISTAKE #1 - Small L1 data cache - I couldn't believe it myself when I first saw the results, but Intel's own statements confirm it. The Pentium 4 has a grossly under-sized 8K L1 data cache. That was the size of the L1 cache back in the 486, more than TEN YEARS AGO. Some idiots never learn. The L1 cache is the most important block of memory in the whole computer. It is the first level of memory that the processor goes to and it is the memory that the processor spends most of its time accessing. Intel learned back in the 486 days that 8K of cache was grossly inadequate, raising the size of the cache from 8K to 16K in later versions of the 486 and to 32K (16K code, 16K data) in the P6 family. AMD went a step further with their 128K L1 cache in the Athlon and Duron processors.

Going back to 8K is just plain idiotic. Sure, you save a few transistors on the chip. You also crippled the speed of just about every Windows program out there! The problem is, 8K is not a lot of data. At a 1024x768 screen resolution and a 32-bit color depth, 2 scan lines of video consume 8K of data. Simply manipulating more than two scan lines of video data at a time will overflow the L1 cache on the Pentium 4.

My testing shows that while the Pentium 4 has extremely fast memory access for working sets of data up to 8K in size, at 16K and 32K sizes it is no faster than a 650 MHz Pentium III. The Pentium III's L1 cache, even though running at a much slower clock speed, keeps up with the Pentium 4's L2 cache. The 900 MHz Athlon's 64K data cache in fact outperforms the Pentium 4's L2 cache. Therefore at manipulating sound or video data, the AMD Athlon can manipulate 8 times as much data as the Pentium 4 as quickly as the Pentium 4.

MISTAKE #2 - No L3 cache - Intel originally specified a 1 megabyte L3 cache for the Pentium 4. This third level cache, much like a G4's back side cache or the large L2 cache in the Pentium III and Athlon, provides an extra level of fast memory to help keep the chip from having to access slow main memory. The L3 cache is completely removed in the released Pentium 4 chip. Like I said, some idiots at Intel never learn. Intel learned the hard way when they released the original crippled Celeron processor NOT to cut or eliminate the cache. It's quite obvious that Intel realized early on that an 8K L1 cache would hurt, and compensated by adding the L3 cache. When push came to shove and Intel needed to cut corners, the cut the L3 cache.

How significant a cut is this? Well, consider that Intel DOES make versions of the Pentium III that have 1 and 2 megabytes of L2 cache - the Pentium III Xeon. While more expensive than the regular Pentium III chip, ask anyone with a Xeon if they're trade it in for a regular Pentium III. My testing shows that at working sets between 256K and 2M, a 700 MHz Xeon processor easily outperforms the Pentium 4 at memory operations. How much is 256K or 2M? Well, that's about the typical size of an uncompressed bitmap. It's the reason a Power Mac G4 running Photoshop kills a typical Pentium III running Photoshop. And axing the L3 cache is a main reason why the Pentium 4 is not the G4 killer it could have been.

MISTAKE #3 - Decoder is crippled - In another step back to 486 days of 10 years ago, Intel took a rather idiotic approach to the U-V pairing and 4-1-1 grouping limitations of past decoders. They simply eliminated the extra decoders and went back to a single decoder. Only one machine language instruction can be decoded per clock cycle. The idea behind this twisted logic being that the trace cache eliminates the need to decode instruction every clock cycle. True, IF and only if the code being executed has already been decoded and cache in the trace cache.

But guess what my friends? When a new piece of code is called that is not in the trace cache (or in a traditional code cache), the processor must reach into the L2 cache or into main memory to pull in 64 bytes of memory. Then it has to decode that 64 bytes of code. Well, a typical x86 instruction is about 3 bytes in size, therefore 64 bytes of memory is equivalent to about 21 machine language instructions. Assuming all 64 bytes of code executes, how long will it take a Pentium 4 to decode all of the instructions? 21 clock cycles. How long with it thus take that piece of code to execute? More than 21 clock cycles. Now, compare this to the Pentium III or Athlon. How long will those chips need to decode the bytes? Roughly 7 to 11 cycles.

MISTAKE #4 - Trace cache throughput too low - Remember my analogy about the weak link in the chain. We've already found that the decoder can only feed 1 instruction worth of micro-ops to the trace cache. Then, reading Intel's specs some more, we can see that the trace cache itself can only feed at most 3 micro-ops to the execution units per clock cycle.

The trace cache feeds these micro-ops to the processor core which then executes them in one or more dedicated execution units. Intel's Pentium 4 overview mentions that the Pentium 4 processor core contains 7 execution units:

Together, these execution units can in theory process 9 micro-ops per clock cycle - 4 simple integer operations, 1 integer shift/rotate, a read and write to memory, a floating point operating, and an MMX operation.

Sounds pretty sweet, except for the problem that the trace cache feeds only 3 micro-ops at a time! While on the Pentium III we have the situation that the decoder can feed up to 3 instructions and 6 micro-ops (4+1+1) to the core per clock cycle, the Pentium 4 is crippled to the point of decoding one instruction per cycle and feeding at most 3 micro-ops to the code per clock cycle.

For well optimized code, code which follows the 4-1-1 rule and runs optimally on both Pentium III and AMD Athlon processors, the Pentium 4 is virtually guaranteed to run slower at the same clock speed. I verified this with some common code sequences. No wonder the 900 MHz Athlon keeps beating the Pentium 4 in the benchmarks.

MISTAKE #5 - Wrong distribution of execution units - This is a direct result of mistake #4, and that is that the breakdown of the execution units themselves is completely wrong.

Think about it. 5 of the 7 execution units are dedicated to handling the integer registers, the 8 "classic" registers EAX EBX ECX EDX ESI EDI EBP and ESP. Yet as it's already clear, the Pentium 4 does horrific job of executing legacy code.

Intel's own documents put heavy emphasis on the use of the new MMX registers, both 64-bit and 128-bit MMX registers introduced in the P6 family. Yet only one single execution unit handles MMX. And if you read Intel's specs in more detail, it states that the unit can only accept a micro-ops every second clock cycle. In other words, the 1.5 GHz Pentium 4 can at most execute 750 million floating point operations or MMX operations per second. But MMX is one of the things Intel hypes up!

So why cripple the very feature you're trying to hype?

In a related act of stupidity, Intel put 3 integer ALUs in the core, two of which operate at double the chip speed. So between them, the three ALUs can accept up to 5 micro-ops per clock cycle. But we've already learned that the trace cache can provide at most 3. So one or more integer ALUs sit idle each clock cycle. It is impossible to even feed 4 micro-ops into the two double-speed units. So why did Intel waste transistors to implement a redundant ALU, but then cut corners by eliminating a much more needed second floating point unit? It's just plain idiotic.

MISTAKE #6 - Shifts and rotates are slow - It seems Intel has taken yet another step back to the days of the 486, even the days of the 286, by eliminating the high-speed barrel shifter found in all previous 386, 486, Pentium, 68020, 68030, 68040, and PowerPC chips. Instead, they created the shift/rotate execution unit, which by design operates at normal clock speed (not double clock speed), but in my testing actually operates even slower. A typical shift operation on the Pentium 4 requires 4 to 6 clock cycles to complete. Compare this with a single clock cycle on any 486, Pentium, or Athlon processor.

How bad is this mistake? For emulation code, it's absolutely devastating. Shift operations are used for table lookups, for bit extractions, for byte swapping, and for any number of other operations. For some reason, Intel's engineers just could not spare a few extra transistors to keep shifts fast, yet they waste transistors on idle double speed ALUs.

Intel's own documentation is now contradictory. On the one hand, Intel has for years advocated the use of shift and add operations to avoid costly multiply operations. For example, to multiply by 10, it is quicker on the 486 and Pentium to use shifts to quickly multiply by 2 and 8 and then add the results. However, on the Pentium 4 this trick of shift and add can take as long as 6 or 7 clock cycle, which negates much of the benefit over using a multiply.

This appears to have something to do with the fact that the original Pentium 4 design called for there to be two address generation units, which are circuits to quickly calculate addresses for memory operations. In previous chips, the AGU contained a barrel shifter to quickly handle indexed table lookups, which the Pentium 4 now handles using the much slower ALU. The "add and shift" trick was usually accomplished by the AGU by a programming trick using the LEA (load effective address) instruction. This trick is now rendered useless thanks to Intel cutting out the part.

MISTAKE # 7 - Fixed the partial register stall with a worse solution - While it is true that the partial register stall is finally a thing of the past in the Pentium 4, Intel's solution is less than elegant. It is not only worse that AMD's solution, but actually worse than the problem it tries to fix. Accessing certain partial registers now involves the shift/rotate unit, meaning that a simple 8-bit register read or write can take longer than accessing L1 cache memory! It's backwards!

MISTAKE #8 - Instructions take more clock cycles to complete - This is not so much a specific mistake as it is an overall side effect of the first 7 idiotic mistakes. The end result of all the cost cutting and silicon chopping is that typical code sequences now take more clock cycles to execute than on the P6 architecture. Intel relies on the much faster clock speed of the Pentium 4 to overcome this problem, but this only works against the Pentium III and slower Intel processors. Against the AMD Athlon, it loses badly.

As I mentioned above, typical code sequences generated by C++ compilers now take more clock cycles to execute. This is due in part to the brain dead decision to only decode one instruction per clock cycle, to only feed 3 micro-ops to the core per clock cycle. And partly due to the longer pipeline used in the Pentium 4, flow control operations (such as branches, jumps, and calls) take longer because it takes longer to fill the processor pipeline.

For example, an indirect call through a general purpose register, common when making member function calls in C++, now takes about 40 clock cycles on the Pentium 4.  Compare this to only 10 to 14 cycles on P6 family and AMD Athlon processors. Even at the faster clock speed, the Pentium 4 function calls are slower overall. Similarly, Windows API calls, which call indirectly through an import table, are now slower. Several Windows APIs that I tested literally took 2 to 3 times the number of clock cycles to execute on the Pentium 4. This is because not only do all the internal function calls within Windows take longer, but you have to remember that Windows 2000 and Windows Millennium are compiled using C++ compilers that optimize for Pentium III and Athlon processors. So as I mentioned at the beginning, until such time as most Windows code is recompiled using as-yet-non-existent Pentium 4 optimized C++ compilers, the performance of Windows applications will be terrible on the Pentium 4 processor.

The Verdict

If it isn't clear already, the Pentium 4 is a terrible choice for PC users. It is a severely crippled processor that does not live up to its original design specifications. Its makes inefficient use of available transistors and chip space. It places a higher burden on software developers to optimize code, contrary to the trends being set by AMD and Transmeta processors. It reverts to 10 year old techniques which Intel abandoned and apparently forgot why. And it just plain runs slower than existing Pentium III, Celeron, and AMD Athlon chips.

Intel needs to heavily beef up the L1 cache size, add the missing L3 cache, add more decoders, raise the transfer rate from the trace cache to the core, lower the cost of shift operations, and add additional FPU and MMX execution units. Once these changes are made, and only then, will the Pentium 4 begin to even be a threat to its own Pentium III and Pentium III Xeon processors and be a viable contender for the speed crown currently held by the AMD Athlon.

Intel, Dell, Gateway, and other computer manufacturers have intentionally misled consumers about the performance of the Pentium 4 processor and are currently involved in selling useless overpriced hardware to unsuspecting consumers. Do what I did with my two Pentium 4 machines. Send them right back to Gateway and Dell and let them know they're selling crap. A $1500 Athlon system is a far better choice of computer than a $4000 Pentium 4 system. No doubt about it and I challenge anyone from Intel, Dell, or Gateway to prove my statements wrong.

Comments? Flame mail? Got a Pentium 4 horror story to share? Email darekm@emulators.com

Also I'd like to point people at an excellent web site that I've read for years - http://x86.org - through which I tracked down some of the info for this review. Robert's site is updated almost daily with fresh news about Intel and also includes some excellent technical information about Intel's processors.

Some other hardware related sites and pages you might want to check out (yes, I'm repeating a few sites I already mentioned):

Gateway's Pentium 4 page

Tom's Hardware

http://www.eet.com/story/OEG20001213S0045

http://www.theregister.co.uk

http://www.athlonoc.com

http://www.apushardware.com

Intel Pentium 4 Processor Optimization manual

Pentium III manuals

AMD Athlon manual

http://www.mersenne.org


Updates

I'll try not to edit the original text any more since I'm sure no one wants to read through 25 pages to see what I changed. So I'll list it here and respond to some email questions and comments.

Forgotten chips

Several people reminded me of some x86 clone processors I didn't mention on this page. Back in the 80's NEC put out a couple of x86 clones, the V20 which was a clone of the 8088, and the V30 which was a clone of the 8086. Both chips offered slightly faster performance than the Intel parts. To be honest I have no clue what ever became of those processors. I've seen reference to the V40 and V50 but perhaps someone can fill me in the details of those chips.

In addition to AMD there were also clone chips by Cyrix and WinChip. I particularly liked the early Cyrix chips, as they kept up with Intel's 386 and 486 chips. Since running the last exhaustive set of emulation benchmarks 5 years ago, (back in the 386 and 486 days, wow!), I've highly recommended the Cyrix 486 upgrade for 386 processors. The upgrade not only more than doubles the speed of a 386 (using a simply chip replacement which requires no additional wires or heat sinks) but it also makes the computer compatible with Windows 95 OSR2 and Windows NT 4.0, bringing even a 14 year old 386 machine up to date with Windows 95, DirectX, Internet Explorer, and our emulation products. True, you wouldn't even dare run Windows Millennium or Windows Media Player on such a machine, but it's a great way to turn an old computer that would normally go in the junk into a neat little email terminal.

Back in the "good old days" processor packages and pinouts didn't change on a monthly basis like they do today. You could actually plug in processors from different manufacturers into the same computer. For example, I have a 90 MHz Gateway computer (or Gateway 2000 as they called themselves before) which I first upgraded with a 180 MHz Pentium MMX Overdrive chip, and then with an Evergreen WinChip chip. Most recently a few months ago I upgraded the machine yet again to an AMD K6-2 running at about 350 MHz. Same computer, over a 6 year period, running 4 different processors from 4 different manufacturers.

The same is true somewhat today, with Slot 1 motherboards supporting the Pentium II, the Celeron, and first generation Pentium III. I've upgraded a number of machines built around the ASUS 440BX motherboard from 233 MHz Pentium II to 366 MHz Celeron to 400 MHz Pentium II to 500 MHz Pentium III. Ditto for Athlon and Duron processors, unfortunately you can no longer mix and match the latest AMD and Intel chips on the same motherboard.

MMX did run faster than 233 MHz

A number of people disagreed with my statement that MMX processors ran at 266 and 300 MHz and emailed me to state that the Pentium MMX processor never went past 233 MHz. Some even pointing at Intel's own data sheets that say so. Unfortunately, these people are wrong, as anyone with a memory going back to 1997 or 1998 can easily remember Compaq and a number of other manufacturers putting out 266 and 300 MHz MMX boxes.

In fact a very quick scan of Intel's web site easily proves the existence of the 266 MHz MMX chip:

http://www.intel.com/pressroom/archive/releases/mP011298.HTM

and another year later came out with the 300 MHz part:

http://www.intel.com/pressroom/archive/releases/mp010799.htm

a little bit more searching of the Internet news sites confirms the release of a 266 MHz desktop MMX chip. I'll leave this to the reader. Anyone who's already forgotten the years 1998 and 1999, hmmm, I have to wonder about you.

RISC vs. CISC, get over it

This seems to have touched a nerve with a number of people, since I called the whole CISC vs. RISC argument bullshit. It is. RISC is just a simpler way of designing a processor, but you pay a price in other ways. By placing the constraint that the instruction set of a processor be fixed with, i.e. all instructions be 2 bytes, or 4 bytes, or 16 bytes in size (as with the Itanium), it allows the engineers to design a simpler decoder and to decode multiple instructions per clock cycle. But it also means that the typical RISC instruction wastes bits because even the simplest operation now requires, in the case of the PowerPC, 4 bytes. This in turn causes the code on the RISC processor to be larger than code on a CISC processor. Larger code means that for the same size code cache, the RISC processor will achieve a lower hit rate and pay the penalty of more memory accesses. Or alternately, the RISC processor requires a larger code cache, which means more transistors, and this merely shifts transistors from one area of the chip to another.

The people who declared x86 and CISC processors dead 10 years ago were dead wrong. CISC processors merely used the same kind of tricks as RISC processors - larger cache, multiple decoders, out-of-order execution, more registers (via register renaming), etc. In some cases, such as during task switching when the entire register set of the processor needs to be written to memory and then a different register set read in, the larger number of "visible" registers causes memory memory traffic. This in turn puts a load on the data cache, so as with the code cache, you either make it larger and use more transistors, or you pay a slight penalty.

My point is, these idiots from 10 years ago were wrong that RISC is somehow clearly superior to CISC and that CISC would die off. It's merely shifting transistors from one part of the chip to another. On the PowerPC, all instructions are 32-bit (4 bytes) long. Even a simple register move, an addition of 2 registers, a function return, pushing a value to the stack, all of these operations require 4 bytes each. Saving the 32 integer registers alone requires 128 bytes of code, 4 bytes per instruction times 32 instructions. Another 128 bytes of reload them. Ditto for the floating point registers. So who cares that it simplifies the decoder and removes a few transistors there. It causes more memory traffic and requires more transistors in the cache.

And the decoding problem is not that big of a problem for two reasons. I'll use the example of the 68040, the PowerPC, and the x86. A PowerPC chip can decode multiple instructions at once since it knows that each instruction is 4 bytes long. A 68040 processor has instructions that are a minimum of 2 bytes long and can go up to 16 bytes in size (I think, I can't think of an example off the top of my head that's longer than 16). Let's say 16. The necessary bits required to uniquely decode the instruction are usually found in the first 2 bytes of the instruction, 4 bytes for floating point. That's all the decoder needs to figure what this instruction is. It needs to decode the additional bytes only in cases of complex addressing modes. This is one area where Motorola screwed up (and likely decided the fate of the 68K) is that they truly made a complex instruction that requires decoding of almost every byte.

In the case of x86, Intel either lucked out or thought ahead and made sure that all the necessary bits to decode the instruction are as close to the beginning of the instruction as possible. In fact, you can usually decode an x86 instruction based on at most the first 3 bytes. The remaining bytes are constant numbers and addresses (which are also constant). You don't need to decode, say, the full 15 bytes of an instruction, when the last 10 bytes are data that gets passed on down into the core. So as one reader pointed out in email, Intel stuck with the old 8-bit processor techniques (such as the 6502) where you place all your instruction bytes first, then your data bytes. In the case of the 6502, only the first byte needed decoding. Any additional bytes in the instruction were 8-bit or 16-bit numeric constants.

So decoding x86 is quite trivial, almost as easy as decoding RISC instructions. AMD seems to have figured out how to do it. Intel almost had it figured out in the P6 family (with only the 4-1-1 rule to hold them back), and then for the Pentium 4 they just decided to cut features and just gave up on the whole decoding things. That's Mistake #3 on my list of course, but this in no way demonstrates how superior fixed instruction sizes are over variable sized instructions.

Over the years, CISC and RISC have kept pace with each other. Sure, one technology may leap ahead a bit, then the other catches up a few months later. Neither technology has taken a huge lead over the other, since the decision whether to used fixed or variable sized instructions and whether to have 8 16 32 or 64 registers in the chip are just two factors in the overall design of the processor. Much of the rest of the design between RISC and CISC chips is very similar. And over time, ideas get borrowed both ways.

A specific code example showing the difficulties of optimizing for Pentium 4

Someone claiming to be a "senior programmer" at a well known video game company took issue with my #6 and #7 mistakes, which both boil down to the same problem of slow shift operations. Intel removed a couple of units from the Pentium 4 called address generation units (or AGUs). These nifty little circuits introduced about 15 years ago in the 386 and 68020 processors have been present in every single 68020, 68030, 68040, PowerPC, 386, 486, Pentium, Celeron, K5, K6, and Athlon, and in other 32-bit x86 clones.

They do a couple of things. First, they can add up to 3 three numbers together in a single operation. This is useful both for accessing a memory element in an array, for calculating the address of something in an array, or, as a pure mathematical calculation involving 3 values. Starting with the 486 processor, this became a one clock cycle operation, allowing the 486 to perform simple mathematical operations as quickly as a RISC processor. RISC processors have for a long time had the ability to take 2 or 3 source values and add them and store them in a 3rd or 4th register. This is a case of a good idea being stolen from RISC designs and put into a CISC processor.

Thanks to this extra circuit, a shift operation is as basic an operation as an ADD or a SUBtract. Shifts are frequently used. For example, in C language, bitfields are read and written using shift operations. If it's a signed bitfield, two shift operations - a shift to the left to blow away unused high bits, then an arithmetic shift to the right to propagate the sign bit. Shifts are also used when calculating addresses. For example, if you have an array of integers and you want to access the integer indexed by the EAX register, you would use an instruction such as

   MOV EBX,[array+EAX*4]

to read the integer. "array" is a 32-bit numeric constant that specifies the address of the array, while the EAX*4 means to scale the value in EAX by 4 (using a quick shift to the left) and then to add that to the address of the array. If it is an array of floating point numbers, use a scaling value of 8 instead of 4, since a floating point number (a "double" in C language) occupies 8 bytes of memory.

If the address of the array is not constant, for example, it is allocated at run time, or is a multi-dimensional array, then you can use a second register in place of the constant. For example, if the address of the array is stored in the EBP register, then the instruction becomes:

   MOV EBX,[EBP+EAX*4]

This is one way local variables in a C function are accessed. If you simply want to calculate the address of the memory instead of reading it, you use the LEA (Load Effective Address) instruction to store the address into EBX instead of the value pointed to by EBX.

In general, the MOV and LEA instructions can use addressing modes of the form [base + index*scale + displacement] where "base" is any of the 8 32-bit integer registers, "index" is any other 32-bit register (and can include the base register), "scale" is a scaling factor of either 2, 4, or 8 (or none for a default scaling of 1), and displacement is a 32-bit integer which contains an address or an offset. Not all of the 4 addressing components need to be used.

With the LEA instruction, the x86 processor can now perform a 3-number add, with something like a C expression "a = b + c + 10;" translating into EAX = EBX+ECX+10 and being coded into one instruction:

   LEA EAX,[EBX+ECX+10]

Notice that no memory is actually referenced. LEA is used merely to calculate values by performing the addition of a base register (EBX) with an index register (ECX) with some constant displacement (10). This is what the address generation unit (AGU) does, allowing the processor to quickly calculate addresses of array elements, screen pixel locations, and do some basic arithmetic in one clock cycle. Without this trick you would have to break it up into multiple instructions:

   MOV EAX,10
    ADD  EAX,EBX
    ADD  EAX,ECX

This not only requires more code bytes but runs slower since the three instructions may not all decode in one cycle, and the operations happen serially not in parallel. LEA makes it a breeze to evaluate simple expressions quickly and allows x86 to keep up with RISC processors at doing such basic operatings.

On all of the processors I mentioned above (68K, PowerPC, x86) the addressing modes can scale by a factor of 2, 4, or 8 with no overhead thanks to the fast shifter. This scaling trick can be used to quickly multiply by small constants, say, the count of bytes in a scan line when drawing to video memory or the number of bytes in a column of a two dimensional array. Multiplying by a constant is not a rare occurrence by any means.

If you look at the #6 and #7 mistakes listed above, I'll show how something as simple as multiplying a register by 10 becomes slower on the Pentium 4. In my description of that problem I used the example of 10, stating that to multiply by 10 you can multiply by 2, multiply by 8, then add the results. To keep the explanation simple, I did not go into the actual machine language details of how that would be done. But based on his and other people's email feedback I shall demonstrated that now and you will see just how much more difficult it is to optimize code for the Pentium 4 because of Intel's decision to cut a basic feature that has been around for 15 years.

The slowest way on most if not all x86 processors to multiply an integer is to use a variant of the MUL instruction, such as a signed integer multiply:

   IMUL EAX,10

This can take 5, 10, 20, or even more clock cycles depending on the exact processor. Ditto with the Motorola chips. Generic 32-bit multiplies take a long time because to implement a multiplier requires thousands of transistors and delays between the various outputs of those transistors. The multiplier circuit also produces a 64-bit product. Most compilers and programmers merely end up throwing away the upper 32-bit bits of the result. A waste.

Let's see what Microsoft Visual C++ 6.0 compiler does. Feel free to try this yourself. With no optimizations the compiler produces exactly that code for an expression such as x*10:

    IMUL EAX,10

With full optimizations it generates the following code. Due to the simplicity of this expression the compiler generates the same code regardless of whether you use -O1 -O2 -G4 -G5 or -G6):

    LEA EAX,[EAX+EAX*4]
    SHL EAX,1

This multiplies by 10 by first multiplying by 5, then multiplying by 2. Three operations are involved, a shift, an add, and a shift. The first shift and add execute in one cycle thanks to the AGU, so on most x86 processors (excluding the Pentium 4) this will take 2 clock cycles. Even less if out-of-order execution pairs these instructions up with other instruction. But in general, 2 cycles.

Why can this not execute in a single clock cycle? Because of the data dependency between the two instructions. SHL (shift left) cannot do its job until the results of the LEA are known. On the Pentium 4, the slow shift unit makes this code take 6 clock cycles to execute, three times slower than expected. A fine example of how today's compilers (and the Windows code YOU are running on your PC right now) are not ready for Pentium 4.

I have a theory of why they use this code sequence, but I'll get to that in a bit. Can we do better? The code sequence which the "senior programmer" suggested was to do it like this:

    LEA ECX,[EAX+EAX]
    LEA EAX,[ECX+ECX*4]

This is valid (ECX gets EAX*2, then EAX gets ECX+ECX*4 = (EAX*2) + (EAX*4*2) = EAX*10) and takes advantage of the fact that a multiply by two can either be encoded as an addition of a base register to an index register, or a multiplication of the index register. Since non-scaling addressing existed in 16-bit processors, using the addition form produces shorter code than the *2 form. So this code is pretty good, and on a Pentium III sure enough still takes exactly the same 2 cycles to execute as Microsoft's code.

On the Pentium 4 this now drops to 3 cycles since he has eliminated an entire shift operation. But, he forgot that on older processors such as Pentium MMX and earlier, there is a one cycle delay in the AGU, and so this code takes 4 cycles on older processors compared to 3 cycles for Microsoft's code.

Oh oh.

He also makes the novice mistake of thinking that "spreading the load" to a second register will somehow speed up the execution. As if my using ECX as a temporary register will eliminate the data dependency between instructions. Of course it won't! In fact, the follow code which only uses EAX runs at exactly the same speed (including on the Pentium 4 and Pentium MMX) as his code:

    LEA EAX,[EAX+EAX]
    LEA EAX,[EAX+EAX*4]

When writing code, you want to minimize the number of "visible" registers used. On older processors, the delay in the AGU kills any advantage used a temporary register may get you in the pipeline. On P6 family, Athlon, and Pentium 4, the register renaming automatically assigns a second register is necessary.

So to the beginner machine language programmer: limit your use of registers when evaluating an expression unless there is a real gain to be had. In this case, yes, there is an optimization that specifically helps the Pentium 4 if a second register is used. And this is where my "multiply by 2, multiply by 8, and add the results" technique comes in.

Notice that in both the Microsoft code and in the "senior programmer" code, reversing the two instructions does not eliminate the data dependency. But what if you did this, making use of a temporary register:

    LEA ECX,[EAX+EAX]
    LEA EAX,[ECX+EAX*8]

To the average programmer, this looks like the same thing. After all, x*2 + x*8 = (x*2)*5 = x*10. What's different???

The difference here is when you think about what a crippled Pentium 4 now has to do with no AGU unit present. How did Intel engineers work around their "budget cut"?

What they did is to break down address calculations into the basic add and shift operations, or micro-ops, and then they feed those into the various execution units. [EAX+EAX] now becomes a simple add micro-op which can be executed by any of the ALUs. The EAX*8 now becomes a shift operation executed by the slow ALU. And the final addition is another quick add. Add, shift, add. Isn't that the SAME thing the previous example does?

On every other x86 processor that I've tested, these two sequences of code are essentially the same thing and execute in exactly the same number of clock cycles. On Pentium III, on Pentium MMX, on Athlon, same speed. But on Pentium 4 my code will execute slightly faster in most scenarios because the shift unit is freed up a cycle sooner. By eliminating the AGU and breaking up the address calculation, the processor can now take advantage of out-of-order execution and start the shift operation on same clock cycle as the first addition. Then once the shift is complete, the second add is executed. Remember, out-of-order execution does not only mean that the individual instructions are executed out-of-order, but parts of the instructions (the micro-ops) are as well. This is true on P6, on Athlon, and on Pentium 4.

So Intel engineers worked around their decision to cut the two AGU's by taking advantage of out-of-order execution to make up some of the loss. However, it does so at the expense of requiring compiler writers to change their code and hurt speed on older processors. It also hurts C++ code that tends to use a lot of double indirection and table lookups, which results delays caused by back-to-back shifts being sent through a single slow shifter.

And that is why I have to whole heartedly disagree with Intel's decision to cut not even one but both units. It breaks the 15 year old pattern that programmers have relied on having faster shift and add operations.

Look at the Motorola PowerPC chip. It has a whole arsenal of fast shift instructions, that can perform bitfield extractions, bitfield insertions, left shifts, right shifts, rotates, and bit masking operations, all in a single clock cycle. This is perfectly suited for C and C++ code.

So lets go back to why Microsoft's compiler emits an LEA and a SHIFT instead of two LEA instructions. Again, it has to do with the fact that address generation had some extra overhead on earlier processors such as the Pentium MMX and 486, and since those processors do not support out-of-order execution, it is quicker to sometimes use addition and shift operations directly. Taking Microsoft's code and rewriting it to:

    LEA EAX,[EAX+EAX*4]
    ADD EAX,EAX

produces code that runs as quickly as the original example on all past x86 processors, with the benefit of running faster on the Pentium 4. This code sequence is not optimal on Pentium 4 but is the 2nd best choice. This would be the code that as a compiler writer or assembly language programmer I would choose as the best overall code that would give ideal or near to ideal speeds on all 32 bit x86 processors. If I was optimizing specifically for Pentium 4 and could punt 486 and Pentium MMX a little, I would use the two LEA sequence to implement the (x*2)+(x*8) expression.

This trivial and somewhat contrived example shows just how even the most basic code sequences used in Windows code today needs to be re-examined and changes made to compilers. It's going to be messy. Intel could save all of us a lot of trouble (and speed up today's off-the-shelf) software by making one of two simple modifications to future Pentium 4 processors with respect to this whole Mistake #6 and #7 shifting thing:

I'd be happy with either. As always, send your comments to: darekm@emulators.com