Prescott Cover

April 8th, 2004


The first mainframe I saw was in 1975, an IBM 360, obviously with no monitor: there was a little desk, about 80 cm. wide (about 30") with a keyboard, on the top of which there were two tractors who pulled out a 132 columns wide sheet of paper. Instructions were in ForTran on carton cards, and once written, the whole deck of cards came input through a type-punch-card-reader. And while the computer needed two small rooms, the storage memories were in the hallway. Those were the times when technicians wore a white uniform, like a doc or a nurse.

Five years later, in 1980, I was called as sysAdmin on a midi system: a DEC PDP-11; I began serious programming, adventurously obtaining a ForTran compiler and a linker. A small network with three stations connected through serial ports. Only the 8 double floppy-disk unit was 70 cm. x 90 cm. (about 2 x 3 ft.), weighting about 25 kg. (about 60 lbs.) Yes, you heard well: a double floppy disk drives was 25 kg.

Last CPUs of nowadays production, codenamed Prescott, has 125 millions transistors for a middle model. So we arrived at this Prescott. A real improvement? Well, as well examine in the next paragraphs, there are really some major changes that truly justified a new name, so why will Intel keep the same name Pentium 4? Also this well try to discover.


This time looks like Intel really did a good job. There are several improvements from the last Pentium 4, codenamed Northwood. The first is a whole set of improvements, unified under the name Intel NetBurstTM Microarchitecture, for an updated yielding higher frequency and raw performance. This group includes:
  1. larger on-die caches;
  2. deeper pipeline;
  3. an improved Hyper-Threading Technology;
  4. better clock distribution;
  5. SSE - 13 new instructions;
plus some other minor enhancements like:
  1. 90-nanometer process technology;
  2. seven layers of low-k copper interconnect;
  3. a better branch prediction systems.
Prescott Top

Greater Levels of on-chip Cache

Not mentioning that many years ago the DEC chip Alpha already had on-board 2 MB of cache on its top model, Intel added 1MB Level 2 Advanced Transfer Cache which allows to transfer more data on each core clock, delivering a much higher data throughput channel. Versus the 512 KB of Northwood. Now this is very important in a HT Technology-enabled processor like Prescott. In these cases, two threads are running on one processor and consequently need more cache space.

Then the Intel rocket scientists launched many studies on inside flows of average data and which are most used by the processor, and changed the algorithm that manages the cache. Previously we had a queue placement managed by the so called Last Recently Used (LRU); now in the words of Anand Chandrasekher, one Intels Vice President: "We've also enhanced the data supply into the CPU. We've looked at where does the CPU spend most of its time, where does the platform spend most of its time in a notebook environment, again keeping in mind how users use notebooks, and based on that information and all of the studies, optimized the cache so that it's not only a larger cache. It's also a very power aware cache. We've also modified the system bus and added some new instructions or modified our pre-fetch instructions to deliver better overall performance while delivering power management at the same time".

Further improvements according to Ron Smith, a Senior Vice President, are: "In our Intel StrataFlash memory, we have introduced something called a high speed burst interface specifically for bursting information into the cache of the XScale, to keep that cache filled, effectively giving you even higher MIPs by how you balance the load with your memory".

Plus we have a larger L1 cache: 16 KB vs./ the 8 KB for Pentium 4. All this - plus other enhancements - sounds great, doesnt it? So why arent the benchmarks a lot better than Pentium 4? But now lets go on.
Prescott bottom

Deeper Pipelines

Intel architects were not happy with simply shrinking the CPU and adding more cache. Another way to scale performance is increasing the clock frequency. So one way to higher clock rate is to increase the number of pipelines stages, inasmuch more stages allow less circuit propagation delay per stage.
So the pipeline in Prescott has been extended from 20 stages to the actual 31. Big deal uh?

Well, but what if the instructions in the pipeline are not required? I mean that if a pipeline flush occurs, due to poor branch prediction the penalty could be heavy. In Northwood it was a 20 cycle performance penalty. So it's required a good knowledge of what is going to fill the pipelines. This leads us to the next paragraph.

Northwood Prescott
            Northwood                         Prescott

Branch Prediction

The improvements respect to the Northwood/Willamette architecture have been made basically in two areas:
  1. the Branch Target Buffer (BTB) where the flow of x86 instructions is loaded
  2. the trace cache (the L1 cache containing the micro-ops)
The BTB has 2k entries, from the previous 512, and the game is played between Static Prediction and Dynamic Prediction. In previous P4 branch prediction algorithms were part of the loop, while in Prescott has been added some logic to manage if the next branch is part of the current loop, or belongs to a next loop. So if a branch is not in the BTB and must be predicted, a flag comes added to both branches. If a pre-stored threshold is surpassed, that branch comes discarded.

On the other side the Dynamic Branch Prediction is increased with adding an indirect branch predictor. This technique allows an increment, according to Intels data, from 2 to 20%.

So branch prediction is a core issue to improve performance: if the CPU knows or at least guesses whats coming on, it can fill its pipeline in a more efficient way. But in making longer pipeline the processor is more exposed and vulnerable to misprediction.

Prescott block diagram
                            Prescott Block Diagram

Improved Hyper-Threading Technology

Intels geniuses say they added specific features to Prescott to enhance its HT performance. Lets shortly review what HT means. You understand that if in your desktop you have 2 CPUs, things happen faster. Thats what HT does: simulates 2 CPUs, it splits itself in two halves and work on two things at the same time. So you dont need two CPUs, and dont need to pledge or mortgage your mother-in-law.

At Intel they are very proud about the HT Technology. So you expect to go twice as fast as non HT technology? Wrong. And its neither even close. However it helps, its better than to receive a ticket for a jay-walk.

Prescott Die Text
                            Prescott Die Text

Better Clock Distribution

Here I didnt find many news. There is only a note on the Intel site that declares: Better clock distribution: up to four times better than the previous desktop platform based on the Intel NetBurst architecture. This in part helps Prescott scale to the 4-5GHz range .

SSE3 - 13 New Instructions

Intel declares: 13 new processor instructions designed to improve performance for special application areas such as media and gaming. These instructions are grouped into five areas: floating point to integer conversions, complex arithmetic, video encoding, SIMD floating point using AOS format and thread synchronization.

Is it all gold that shining? Or is there some copper? Oh poor programmers ! Lets hear whats said in the grapevine (a serious site for benchmarking) (1) : The anomaly here is Prescott. Prescott's 3D rendering performance trails pretty much everyone in LightWave and lags behind all the Intel processors in 3ds max. We find this a bit puzzling, as this type of rendering code isn't terribly branchy. The same holds true for the Cinebench test, whose workload is a bit more synthetic. In fact, Prescott would be dead last here, save for its Hyper-Threading capability, which allows it to post a dual CPU score higher than the Athlon 64 single CPU scores ".

Perhaps optimizing these applications for SSE3 will offer some boost. That certainly occurred with the older Pentium 4's, which once performed relatively inefficiently with 3D content creation applications. As optimizations were added for the P4, rendering performance saw a substantial boost.
Mom, I wannabe a fireman. Or maybe a plumber, considered how expensive they are. Smile folks, tomorrow it will be worse.

Northwood Die
                            Northwood Die

                            Prescott Die

90-nanometer Process Technology

Here the matter begins to be serious. A source [2] says: " Something Rotten in Santa Clara.-- Despite what seems to be a largely improved processor, and one that should easily outperform a Northwood-core Pentium 4 at equivalent clock speed, this is not the case. Further, there are some strong indications that there is something very seriously wrong with Intels 90nm process.

It refers to the fact when Intel dropped from the 180 m Willamette down to the 130 m Northwood there was at beginning a gain in the order of 20%, that became a whopping 60% at the end of the process. Indeed we read in [2]: The final Northwood at 3.40GHz is 70% faster than the fastest Willamette as a result of the success of the 130nm process. This time, on the other hand, the drop to 90nm seems not to be resulting in the usual improvements. So much so, in fact, that a rather last-minute change to the pipeline was necessary to produce decent yields at the promised speeds. The longer pipeline will lower Prescotts IPC, and largely offset any gains as a result of the improvements discussed.

However the passage from 130 m to 90 m helps to keep small the dimensions while increasing the number of transistors. So we passed from the Northwood 55 millions transistors to the Prescott 125 millions transistors.
So if I had some skylarks small eggs, now I have a place where to fry them.

Seven Layers of low-k Copper Interconnect

Granny can you do a cake? Prescott has seven layers. The better definition I found is in [3]: By the addition of a low-k insulator - carbon-doped oxide (CDO) if you care - between these copper interconnect layers, wire-to-wire capacitance is reduced and internal signal speeds are increased. This is particularly important as the manufacturing process shrinks because the whole circuitry becomes much denser and more tightly packed which in turn increases the risk of signal leakage and cross talk, definitely not a great idea in a mission critical server. By adding an extra layer (previous P4s had 6), Intel has been able to draw a compromise between die density and manufacturing costs.

Whats Happened to Pentium5 ?

Well, poor Prescott, its just born, and destined to die.
Intel is going to use it on the Canterwood chipset, and in the next quarter will move to Grantsdale, moving to LGA775 socket. Celeron Prescott will be launched in Q2. Then Intel will move to Tejas, the new architecture (apparently due for 2005 Q2).

Tejas in [8]: Successor to Prescott likely to be called Pentium V built on a 90 nm process, supporting IA-32e, DDR-II and PCI Express. Runs cooler and quitter than current chips. Uses a new LGA775 socket, with a 1066 MHz bus and starts at around 4.5 GHz, moving to around 9 before it is canceled. To include the 8 new "Tejas New Instructions" aka Azalia (Azalea), for improved audio multistreaming, speech recognition, Dolby Digital etc. Also includes "Extended Enhanced HyperThreading" Uses dual channel DDR-II DRAM at 533 MHz and the Alderwood and Grantsdale chipsets. Will later be built in a 65 nm process, with a size of 80-100 mm2. The 65 nm version will get 2 MB of L2 cache on chip. Earlier listed for a 04-Q4 launch, but recent news claim Prescott delays have moved the launch into 2005.
Poor Prescott, so young. -- So all this publicity looks like just to pop the AMD IA64 party balloons.


Lets the benchmarks talk. In [4]: First, Prescott definitely does not represent monkey business, as it performs at the same level as the Northwood. In addition there is the unknown thing called SSE3, the use of which remains to be seen. That is a summary of Intel's Prescott processor in two sentences. However, is the Prescott a trendsetter?
In our opinion, the debut of a processor based on a cutting-edge process technology and architecture that only offers the same performance level as the predecessor is unanticipated. Despite 1 MB L2 cache and some optimizations, Prescott is slower than Northwood in roughly a third of our benchmarks. Software, like many 3D shooters and more serious applications like Lame, MS Movie Maker 2, Mathematica, Cinema 4D or even 3DStudio perform worse than before.
On the other hand, there are a similar amount of applications that run faster on a Prescott CPU. These are DivX encoding with Xmpeg, file archiving with WinRAR, video authoring using Pinnacle Studio 9 and the remolded SYSmark 2004

While [5] says: Surprisingly, Prescott disappointed us a bit in the content creation arena. While still holding its own in video processing chores, it lags a bit in audio. Most of all, its performance in 3D content creation seemed sub par. However, the real strength of Prescott seems to lie in its Hyper-Threading performance. In the majority of our multitasking tests, the Prescott performed as well as the higher clocked 3.4GHz Northwood CPU.
Thats it.

What to Buy ?

There is never time to talk of everything. These are only the most evident characteristics of Prescott, but there is much more.
For the introductory marketing, in [6] we read: The Prescott Pentium 4s will be introduced at 2.8, 3.0, 3.2, and 3.4GHz speeds, respectively. The 2.8GHz version will be available in 133MHz (without Hyper-Threading support) and 200MHz Front-Side Bus versions, differentiated by the naming reference of either 2.8A or 2.8E GHz. All other CPUs in the range will have the letter E tacked on to the name, presumably to avoid confusion with the similar Northwood models .

But for what an average user is going to spend a Prescott is good enough, an Athlon is better. While a power-user might consider that a dual system with two Athlon CPU cost less than a single Pentium and perform better.

Around that Corner

What at the horizon?
  • Bus: PCI Express - the new bus evolved from PCI, will be used both for PCI replacement and for chip to chip communications - 16GB bandwidth;
  • CPU: IBM PowerPC 970FX - the Photonic Chip; for internal bus it will use a LED with a receiver on the opposite side;
  • VGA cards w/ 256 MB RAM and astonishing engines;
  • quarter-sized mini HD for digi cams;
and so on and on.
Time is short, gotta go.


Vincenzo Maggio