April 8th, 2004
Introduction
The first mainframe I saw was in 1975, an IBM 360, obviously with no
monitor: there was a little desk, about 80 cm. wide (about 30") with
a keyboard, on the top of which there were two tractors who pulled out
a 132 columns wide sheet of paper. Instructions were in ForTran on
carton cards, and once written, the whole deck of cards came input
through a type-punch-card-reader. And while the computer needed two
small rooms, the storage memories were in the hallway. Those were the
times when technicians wore a white uniform, like a doc or a nurse.
Five years later, in 1980, I was called as sysAdmin on a midi
system: a DEC PDP-11; I began serious programming, adventurously
obtaining a ForTran compiler and a linker. A small network with three
stations connected through serial ports. Only the 8” double
floppy-disk unit was 70 cm. x 90 cm. (about 2 x 3 ft.), weighting
about 25 kg. (about 60 lbs.) Yes, you heard well: a double floppy disk
drives was 25 kg.
Last CPUs of nowadays production, codenamed Prescott, has 125
millions transistors for a middle model. So we arrived at this
Prescott. A real improvement? Well, as we’ll examine in the next
paragraphs, there are really some major changes that truly justified
a new name, so why will Intel keep the same name Pentium 4? Also this
we’ll try to discover.
Improvements
This time looks like Intel really did a good job. There are several
improvements from the last Pentium 4, codenamed Northwood. The first
is a whole set of improvements, unified under the name Intel
NetBurstTM Microarchitecture, for an updated yielding higher frequency
and raw performance. This group includes:
- larger on-die caches;
- deeper pipeline;
- an improved Hyper-Threading Technology;
- better clock distribution;
- SSE - 13 new instructions;
plus some other minor enhancements like:
- 90-nanometer process technology;
- seven layers of low-k copper interconnect;
- a better branch prediction systems.
Greater Levels of on-chip Cache
Not mentioning that many years ago the DEC’ chip Alpha already had
on-board 2 MB of cache on its top model, Intel added 1MB Level 2
Advanced Transfer Cache which allows to transfer more data on each
core clock, delivering a much higher data throughput channel. Versus
the 512 KB of Northwood. Now this is very important in a HT
Technology-enabled processor like Prescott. In these cases, two
threads are running on one processor and consequently need more cache
space.
Then the Intel rocket scientists launched many studies on inside
flows of average data and which are most used by the processor, and
changed the algorithm that manages the cache. Previously we had a
queue placement managed by the so called Last Recently Used (LRU);
now in the words of Anand Chandrasekher, one Intel’s Vice President:
"
We've also enhanced the data supply into the CPU. We've looked
at where does the CPU spend most of its time, where does the platform
spend most of its time in a notebook environment, again keeping in
mind how users use notebooks, and based on that information and all
of the studies, optimized the cache so that it's not only a larger
cache. It's also a very power aware cache. We've also modified the
system bus and added some new instructions or modified our pre-fetch
instructions to deliver better overall performance while delivering
power management at the same time".
Further improvements according to Ron Smith, a Senior Vice
President, are: "
In our Intel StrataFlash memory, we have
introduced something called a high speed burst interface specifically
for bursting information into the cache of the XScale, to keep that
cache filled, effectively giving you even higher MIPs by how you
balance the load with your memory".
Plus we have a larger L1 cache: 16 KB vs./ the 8 KB for Pentium 4.
All this - plus other enhancements - sounds great, doesn’t it? So why
aren’t the benchmarks a lot better than Pentium 4? But now let’s go
on.
Deeper Pipelines
Intel architects were not happy with simply shrinking the CPU and
adding more cache. Another way to scale performance is increasing the
clock frequency. So one way to higher clock rate is to increase the
number of pipelines stages, inasmuch more stages allow less circuit
propagation delay per stage.
So the pipeline in Prescott has been extended from 20 stages to
the actual 31. Big deal uh?
Well, but what if the instructions in the pipeline are not
required? I mean that if a pipeline flush occurs, due to poor branch
prediction the penalty could be heavy. In Northwood it was a 20 cycle
performance penalty. So it's required a good knowledge of what is
going to fill the pipelines. This leads us to the next paragraph.
Northwood Prescott
Branch Prediction
The improvements respect to the Northwood/Willamette architecture
have been made basically in two areas:
- the Branch Target Buffer (BTB) where the flow of x86
instructions is loaded
- the trace cache (the L1 cache containing the micro-ops)
The BTB has 2k entries, from the previous 512, and the game is played
between Static Prediction and Dynamic Prediction. In previous P4
branch prediction algorithms were part of the loop, while in Prescott
has been added some logic to manage if the next branch is part of the
current loop, or belongs to a next loop. So if a branch is not in the
BTB and must be predicted, a flag comes added to both branches. If a
pre-stored threshold is surpassed, that branch comes discarded.
On the other side the Dynamic Branch Prediction is increased with
adding an indirect branch predictor. This technique allows an
increment, according to Intel’s data, from 2 to 20%.
So branch prediction is a core issue to improve performance: if
the CPU knows or at least guesses what’s coming on, it can fill its
pipeline in a more efficient way. But in making longer pipeline the
processor is more exposed and vulnerable to misprediction.
Prescott Block Diagram
Improved Hyper-Threading Technology
Intel’s geniuses say they added specific features to Prescott to
enhance its HT performance. Let’s shortly review what HT means. You
understand that if in your desktop you have 2 CPUs, things happen
faster. That’s what HT does: simulates 2 CPUs, it splits itself in
two halves and work on two things at the same time. So you don’t need
two CPUs, and don’t need to pledge or mortgage your mother-in-law.
At Intel they are very proud about the HT Technology. So you
expect to go twice as fast as non HT technology? Wrong. And it’s
neither even close. However it helps, it’s better than to receive a
ticket for a jay-walk.
Prescott Die Text
Better Clock Distribution
Here I didn’t find many news. There is only a note on the Intel site
that declares: “
Better clock distribution: up to four times
better than the previous desktop platform based on the Intel NetBurst
architecture. This in part helps Prescott scale to the 4-5GHz range
”.
SSE3 - 13 New Instructions
Intel declares: “
13 new processor instructions designed to
improve performance for special application areas such as media and
gaming. These instructions are grouped into five areas: floating point
to integer conversions, complex arithmetic, video encoding, SIMD
floating point using AOS format and thread synchronization”.
Is it all gold that shining? Or is there some copper? Oh poor
programmers ! Let’s hear what’s said in the grapevine (a serious site
for benchmarking) (1) : “
The anomaly here is Prescott.
Prescott's 3D rendering performance trails pretty much everyone in
LightWave and lags behind all the Intel processors in 3ds max. We
find this a bit puzzling, as this type of rendering code isn't
terribly branchy. The same holds true for the Cinebench test, whose
workload is a bit more synthetic. In fact, Prescott would be dead
last here, save for its Hyper-Threading capability, which allows it
to post a dual CPU score higher than the Athlon 64 single CPU scores
".
Perhaps optimizing these applications for SSE3 will offer some
boost. That certainly occurred with the older Pentium 4's, which once
performed relatively inefficiently with 3D content creation
applications. As optimizations were added for the P4, rendering
performance saw a substantial boost”.
Mom, I wannabe a fireman. Or maybe a plumber, considered how
expensive they are. Smile folks, tomorrow it will be worse.
Northwood Die
Prescott Die
90-nanometer Process Technology
Here the matter begins to be serious. A source [2] says: "
Something Rotten in Santa Clara.-- Despite what seems to be a largely
improved processor, and one that should easily outperform a
Northwood-core Pentium 4 at equivalent clock speed, this is not the
case. Further, there are some strong indications that there is
something very seriously wrong with Intel’s 90nm process”.
It refers to the fact when Intel dropped from the 180 µm
Willamette down to the 130 µm Northwood there was at beginning a gain
in the order of 20%, that became a whopping 60% at the end of the
process. Indeed we read in [2]: “The final Northwood at 3.40GHz is
70% faster than the fastest Willamette as a result of the success of
the 130nm process. This time, on the other hand, the drop to 90nm
seems not to be resulting in the usual improvements. So much so, in
fact, that a rather last-minute change to the pipeline was necessary
to produce decent yields at the promised speeds. The longer pipeline
will lower Prescott’s IPC, and largely offset any gains as a result
of the improvements discussed”.
However the passage from 130 µm to 90 µm helps to keep small the
dimensions while increasing the number of transistors. So we passed
from the Northwood 55 millions transistors to the Prescott 125
millions transistors.
So if I had some skylarks’ small eggs, now I have a place where
to fry them.
Seven Layers of low-k Copper Interconnect
Granny can you do a cake? Prescott has seven layers. The better
definition I found is in [3]: “
By the addition of a low-k
insulator - carbon-doped oxide (CDO) if you care - between these
copper interconnect layers, wire-to-wire capacitance is reduced and
internal signal speeds are increased. This is particularly important
as the manufacturing process shrinks because the whole circuitry
becomes much denser and more tightly packed which in turn increases
the risk of signal leakage and cross talk, definitely not a great
idea in a mission critical server. By adding an extra layer (previous
P4s had 6), Intel has been able to draw a compromise between die
density and manufacturing costs”.
What’s Happened to Pentium5 ?
Well, poor Prescott, it’s just born, and destined to die.
Intel is going to use it on the Canterwood chipset, and in the
next quarter will move to Grantsdale, moving to LGA775 socket.
Celeron Prescott will be launched in Q2. Then Intel will move to
Tejas, the new architecture (apparently due for 2005 Q2).
Tejas in [8]: “
Successor to Prescott likely to be called
Pentium V built on a 90 nm process, supporting IA-32e, DDR-II and
PCI Express. Runs cooler and quitter than current chips. Uses a new
LGA775 socket, with a 1066 MHz bus and starts at around 4.5 GHz,
moving to around 9 before it is canceled. To include the 8 new "Tejas
New Instructions" aka Azalia (Azalea), for improved audio
multistreaming, speech recognition, Dolby Digital etc. Also includes
"Extended Enhanced HyperThreading"… Uses dual channel DDR-II DRAM at
533 MHz and the Alderwood and Grantsdale chipsets. Will later be
built in a 65 nm process, with a size of 80-100 mm2. The 65 nm
version will get 2 MB of L2 cache on chip. Earlier listed for a
04-Q4 launch, but recent news claim Prescott delays have moved the
launch into 2005”.
Poor Prescott, so young. -- So all this publicity looks like just
to pop the AMD IA64 party balloons.
Conclusions
Let’s the benchmarks talk. In [4]: “
First, Prescott definitely
does not represent monkey business, as it performs at the same level
as the Northwood. In addition there is the unknown thing called SSE3,
the use of which remains to be seen. That is a summary of Intel's
Prescott processor in two sentences. However, is the Prescott a
trendsetter?
In our opinion, the debut of a processor based on a cutting-edge
process technology and architecture that only offers the same
performance level as the predecessor is unanticipated. Despite 1 MB
L2 cache and some optimizations, Prescott is slower than Northwood
in roughly a third of our benchmarks. Software, like many 3D shooters
and more serious applications like Lame, MS Movie Maker 2,
Mathematica, Cinema 4D or even 3DStudio perform worse than before.
On the other hand, there are a similar amount of applications that
run faster on a Prescott CPU. These are DivX encoding with Xmpeg,
file archiving with WinRAR, video authoring using Pinnacle Studio 9
and the remolded SYSmark 2004”.
While [5] says: “
Surprisingly, Prescott disappointed us a
bit in the content creation arena. While still holding its own in
video processing chores, it lags a bit in audio. Most of all, its
performance in 3D content creation seemed sub par.
However, the real strength of Prescott seems to lie in its
Hyper-Threading performance. In the majority of our multitasking
tests, the Prescott performed as well as the higher clocked 3.4GHz
Northwood CPU”.
That’s it.
What to Buy ?
There is never time to talk of everything. These are only the
most evident characteristics of Prescott, but there is much more.
For the introductory marketing, in [6] we read: “
The Prescott
Pentium 4s will be introduced at 2.8, 3.0, 3.2, and 3.4GHz speeds,
respectively. The 2.8GHz version will be available in 133MHz (without
Hyper-Threading support) and 200MHz Front-Side Bus versions,
differentiated by the naming reference of either 2.8A or 2.8E GHz.
All other CPUs in the range will have the letter E tacked on to the
name, presumably to avoid confusion with the similar Northwood models
”.
But for what an average user is going to spend a Prescott is good
enough, an Athlon is better. While a power-user might consider that a
dual system with two Athlon CPU cost less than a single Pentium and
perform better.
Around that Corner
What at the horizon?
- Bus: PCI Express - the new bus evolved from PCI, will be used
both for PCI replacement and for chip to chip communications
- 16GB bandwidth;
- CPU: IBM PowerPC 970FX - the Photonic Chip; for internal bus it
will use a LED with a receiver on the opposite side;
- VGA cards w/ 256 MB RAM and astonishing engines;
- quarter-sized mini HD for digi cams;
and so on and on.
Time is short, gotta go.
2004.04.08
Vincenzo Maggio
References:
- www.extremetech.com/article2/0,1558,1478691,00.asp
- www.hardwareanalysis.com/action/printarticle/1686/
- www.trustedreviews.com/article.aspx?art=265&head=60
- www.tomshardware.com/cpu/20040201/prescott-30.html
- www.extremetech.com/article2/0,1558,1478697,00.asp
- www.hexus.net/content/reviews/review.php?dXJsX3Jldmlld19JRD02OTYmdXJsX3BhZ2U9Mg==
- www.hardwareanalysis.com/action/printarticle/1686/
- endian.net/news_details.asp?LinkNo=7180
www.extremetech.com/article2/0,1558,1478685,00.asp
www.extremetech.com/article2/0,1558,1478694,00.asp
www.extremetech.com/article2/0,1558,1486581,00.asp
intel/pressroom
intel/labs
www.chip-architect.com/news/2003_03_06_Looking_at_Intels_Prescott.html
www.pcworld.com/news/article/0,aid,114820,00.asp
www.tomshardware.com/cpu/20040201/prescott-04.html
www.tomshardware.com/cpu/20040201/prescott-21.html