Intel Prescott Pentium 4 Processor

Review: Intel Prescott Pentium 4 Processor

Loyd Case

The German strategist Helmut von Moltke once said, “No plan survives contact with the enemy.”

That adage is certainly true in the technology world. In Intel’s ideal world, the company would have had all the time it needed to tweak and perfect its 90nm process. Rumors over the last few months pointed to teething problems with the new process, including higher operating temperatures and power consumption than had been expected with Intel’s strained silicon process. The net result has been a somewhat restrained launch for Intel’s new progeny. Initial plans had called for launching the 3.4GHz CPU in quantity, but yields of 3.4GHz Prescotts have apparently been quite low. One of Intel’s biggest OEMs scaled back its system offerings to not include Prescott-based systems in one product category due to the lack of 3.4GHz Prescott availability.

On top of that, AMD has been on a roll. The recent release of the Athlon 64 3400+ proved to be a pleasant surprise, offering better performance gains than anticipated — a rarity these days. Sales of the new Athlon 64 line have propelled AMD to its first quarterly profit in over a year. Performance enthusiasts have been buzzing about AMD’s new flagship CPUs.

As we noted, Intel originally planned to launch its new 90nm Pentium 4 with a top clock rate of 3.4GHz. In fact, Intel may still paper launch at 3.4GHz, but only 3.2GHz and slower parts will be widely available. Supplies of the 3.4GHz Prescott will be “low,” and Intel will likely say so during its launch events.

To fill the gap, Intel is also launching a pair of new Pentium 4’s built around the older Northwood generation technology. One is a standard Northwood-based CPU, with 512KB of L2 cache, while the other will be an update to the Pentium 4 Extreme Edition (dubbed by some pundits as the “Emergency Edition”). Like the first P4EE, the new chip sports 512KB of L2 cache and 2MB of L3 cache. Both of the new/old CPUs will ship at 3.4GHz. All Prescott CPUs shipping on February 2nd will support Hyper-Threading and an 800MHz FSB (200MHz actual FSB clock, quad-pumped).

Intel supplied ExtremeTech with two processors: a 3.4GHz Pentium 4 Extreme Edition and a 3.2GHz Prescott CPU. Given that, it would be interesting to compare the performance of a 3.4GHz Northwood to Intel’s new baby. We sorely wanted to do this, but Intel was understandably reticent to hand out old-generation CPUs that might “distract” from the launch of their new architecture. However, we were able to obtain a 3.4GHz Northwood from another source, so we have performance data for a nearly complete suite of new CPUs to present — only the rare 3.4GHz Prescott is missing from the mix.

Before we get to the performance tests, though, let’s take a stroll through Prescott’s internal architecture. The new CPU is more than a die shrink, adding some significant architectural enhancements.

Prescott is more than just a die shrink. When Intel moved to the 130nm process with Northwood, designers added an additional 256KB of L2 cache and fine-tuned a dormant feature present in all P4’s existing since the original “Willamette” design: the ability to perform simultaneous multitasking, which Intel dubs Hyper-Threading. Initially, Hyper-Threading was disabled in Northwood chips, but turned on when Intel launched the 3.06GHz version.

Moving to 90nm, Intel has once again enhanced the microarchitecture. Prescott has a number of tweaks, some simply to take advantage of the new process, while others are actually changes to the internal architecture. The most obvious change, brought about by the reduced die size available at 90nm, is the additional cache. Both the L1 and L2 cache sizes have doubled. The L1 data cache is now 16KB, while the L2 unified (data and instruction) cache size is now 1MB. Let’s take a quick look at how the various Intel Pentium 4 CPUs compare, and toss in an Athlon 64 for comparison.

Current Feature Set (2/2/04)

Willamette

Northwood

P4EE

Prescott

Athlon64 (FX-51 & 3400+)

Process

180nm

130nm

130nm

90nm

130nm

Transistor Count (Million)

42

55

178

125

106

Die Size (mm2)

170

131

237

112

193

L1 Cache (KB)

8

8

8

16

128

L2 Cache (KB)

256

512

512

1024

1024

L3 Cache (KB)

NA

NA

2048

NA

NA

Max Frequency 2/2/04

2GHz

3.4GHz

3.4GHz

3.2GHz (3.4GHz

soon)

2.2GHz

The die shrink enables Intel to build a processor with double the cache of Northwood, but with a smaller die size. Intel has also aggressively moved to 300mm wafers, so the net result is a much lower cost per CPU manufactured. Of course, the cost of transitioning to new fabrication technologies still has to be amortized, but the long term result is lower costs, and eventually, lower prices.

We covered the original Pentium 4 architecture extensively two years ago, so our discussion here focuses on changes to the architecture inherent in Prescott.

Intel’s CPU architects weren’t content with simply shrinking the CPU and adding more cache. The underlying philosophy of the Pentium 4 is to scale performance by increasing the clock frequency. One method for enabling higher clock rates is to increase the number of pipeline stages (more stages yield less circuit propagation delay per stage, permitting higher clock rates). A deeply pipelined architecture needs to have fairly good knowledge of what instructions are likely to enter the pipeline in the near future. Further, most software these days isn’t just linear streams of code, but often loops and branches, as needed by the application.

The ability to predict when code will branch, and hence know what code will enter the pipeline, is known as branch prediction. A deeply pipelined architecture needs to have highly accurate branch prediction. If the pipeline is filled with incorrect instructions and has to be flushed and reloaded with proper instructions due to an unpredicted code branch, the performance penalty can be pretty stiff. For example, a pipeline flush in Northwood results in a 20 cycle performance penalty.

The pipeline in Prescott has been extended to 31 stages, so a pipeline flush due to poor branch prediction can result in a much larger clock cycle penalty every time a branch misprediction occurs. Therefore Intel’s architects worked to improve Prescott’s branch prediction over that of Northwood.

Prescott has a few areas of enhancement in branch prediction functionality. Before we get into details, understand that all P4 processors actually have two areas where branch predictions are performed — in the front end of the pipeline, where x86 instruction streams are loaded, and at the trace cache (L1 instruction cache containing micro-ops). Most instruction sequences are retrieved from the trace cache during normal program execution. The pipeline depth (20 or 31 stages mentioned prior) is measured from the point of obtaining the trace cache instruction pointer from the Branch Target Buffer (BTB) associated with the trace cache. You can see this BTB in the block diagram has 2K entries (up from 512 entries in older P4s) versus 4K entries for the front-end BTB (same as older P4s).

Static branch prediction (a technique that relies on prior knowledge of branch behavior before actual program execution, such as knowing most loops branch backwards) was improved in Prescott. In all P4’s, static branch prediction will occur at decode time if the Branch Target Buffer (BTB) has no dynamic branch prediction data for a particular branch. In prior P4 static branch prediction algorithms, backwards branches were assumed to be part of loops, but that’s not always the case. Prescott adds logic to help determine if a backward branch was part of a loop or another type of backwards branch. Loop branches tend to have shorter jumps than other types of backward branches. If the branch was not included in the BTB and must be statically predicted, a check is made on both branch direction and branch distance. If a predetermined threshold for branch distance (seen in typical loops) is exceeded, the branch is predicted to be not taken. In other cases, it was determined that certain conditions would typically result in not taken branch behavior, regardless of distance and direction.

Dynamic branch prediction accuracy is enhanced by adding an indirect branch predictor. Interestingly, this is similar to a technique used in the Pentium-M (Banias) processor. Intel’s trace data revealed that the new techniques improved branch prediction in a number of SPEC benchmark from 2 – 20%.

The Prescott architecture team incorporated additional tweaks to the new Pentium 4 microarchitecture. The L1 data cache associativity was increased from 4-way to 8-way when the size doubled (8K in Northwood to 16K in Prescott). The new 1MB unified, write-back L2 cache is still 8-way set associative, as in past P4s, and still has 128 bytes/cache line.

The size of the instruction schedulers for x87 and all levels of SSE instructions were increased to improve the ability to find parallelism in multimedia code, as were the effective size of the queues that feed all the schedulers, not just a subset. Increasing scheduler queue size reduces allocator stalls, permitting the allocator logic to continue assigning micro-ops to individual functional unit scheduler queues that follow in the pipeline, while also processing machine resource requests from new micro-ops entering the allocator stage.

A dedicated integer multiplier has been added. Previously, the floating point multiplier had been used for integer multiplies, but that increased latency by moving operands to the FP unit and routing the result back to the integer unit.

More types of micro-ops can now be encoded inside the trace cache than in prior P4s, rather than being sequenced by the Microcode ROM (a slow process for complex and/or infrequently used instructions). Two common instruction types that can now be encoded and stored in the trace cache are indirect calls with a register source operand, and software prefetch instructions.

Additional processor resources were incorporated, including the ability to have 32 stores outstanding (versus 24 in past P4s) and increasing the number of write combine buffers to eight (from six). The processor also keeps track of eight loads that missed the L1 data cache; previously, only four missed loads were tracked. Some changes were made to the hardware prefetch mechanism to increase its efficiency, in addition to software prefetches now being stored in the trace cache.

Shift and rotate instructions can now be executed quickly by a new shifter/rotator logic block included in one of the two fast ALUs. In prior P4s, such operations were complex and took many cycles.

Sequencing of load and store micro-ops (instructions) was reworked in Prescott to avoid latency and load re-execution. This occurred when store data is required to be forwarded to a load instruction (prior to storing to the L1 data cache), yet the load micro-op executes prior to the store micro-op. Prescott adds a predictor to indicate a load is likely to need data forwarded from a particular store micro-op, and the load scheduler can hold the load until the specific store is scheduled.

Most of these seem like relatively minor increases in efficiency, but they all serve to also improve Hyper-Threading performance. In fact, some of these changes may have little effect if only a single thread is running, but affect performance in a multithreaded environment. Some additional resources were added to specifically improve performance in a threaded environment, such as the ability to simultaneously access the memory page table while handling a memory access that splits a cache line.

SSE3 Instructions

The new 90nm Pentium 4 adds 13 new SSE instructions, aka “Prescott New Instructions.” These include:

An instruction to speed up x87 floating point to integer conversion

Five instructions to improve the efficiency of loading, moving and duplicating SIMD data, useful in complex arithmetic algorithms

An instruction to avoid cache line splits when loading data, useful in certain video compression applications

Four instructions to enable more efficient handling of arrays of structures. This is useful in 3D graphics, particularly when processing vertex buffers.

Two instructions that help manage thread synchronization, which will in turn improve Hyper-Threading performance.

Like past additions to the SSE instruction set, applications will need to be recompiled — and in some cases, hand-tuned — to take advantage of the new instructions. Since Prescott has been sampling for a number of months now, the wait for some key applications may not be too long. More detailed information on SSE3 is available on Intel’s developer site

We built several testbeds to run benchmarks, striving to keep them as similar as possible. Let’s look at the configurations.

Component

All P4 Systems

Athlon 64 FX-51 System

Athlon 64 3400+ System

Processor

3.2GHz Northwood, 3.2GHz Prescott, 3.2GHz P4EE, 3.4GHz Northwood, 3.4GHz P4EE

Athlon 64 FX-51 at 2.2GHz (socket 940)

Athlon 64 3400+ at 2.2GHz (socket 754)

Motherboard

Asus P4C800-E, Intel 875P chipset, 1014 BIOS

Asus SK8N, Nforce3 Pro 150 chipset, 1004 BIOS

Asus K8V Deluxe, Via K8T800 chipset, 1004 BIOS

Memory

2 x 512KB (1GB) Kingston HyperX PC3200 unbuffered, CAS2-3-3-6

2 x 512KB (1GB) Mushkin PC3200 Registered, CAS 2-2-2-6

2 x 512KB (1GB) Kingston HyperX PC3200 unbuffered, CAS2.5-3-3-6

Graphics

Asus Radeon 9800XT, Catalyst 4.1 drivers

Asus Radeon 9800XT, Catalyst 4.1 drivers

Asus Radeon 9800XT, Catalyst 4.1 drivers

Hard Drives

2 x WD360 10,000RPM SATA drives configured as a RAID 0 striped array, 128K block size

2 x WD360 10,000RPM SATA drives configured as a RAID 0 striped array, 128K block size

2 x WD360 10,000RPM SATA drives configured as a RAID 0 striped array, 128K block size

Optical

Toshiba DVD-ROM

Toshiba DVD-ROM

Toshiba DVD-ROM

Audio

Creative Labs Audigy 2

Creative Labs Audigy 2

Creative Labs Audigy 2

Networking

Intel Pro1000 CSA

Nforce3 10/100 Ethernet

3Com 3C940 Gigabit Ethernet

Chassis

Antec SX-830

Antec SX-830

Antec SX-830

Power Supply

Vantec Stealth 430W

Vantec Stealth 430W

Vantec Stealth 430W

Operating System

Windows XP SP1, all current patches installed, DirectX 9 and Windows Media Player 9 installed

Windows XP SP1, all current patches installed, DirectX 9 and Windows Media Player 9 installed

Windows XP SP1, all current patches installed, DirectX 9 and Windows Media Player 9 installed

Each system was initially configured with a clean install of Windows XP (service pack 1). Then all current critical patches were downloaded and installed from the Windows Update site. We also installed DirectX 9.0b and Windows Media Player 9. Virtual memory was set to be a 2048KB fixed swap file.

We had initially tried to use the latest spin (3.1) of Intel’s D875PBZ. The Intel motherboard group worked to improve memory latency and throughput on the new spin. However, the beta BIOS that had been supplied choked when trying to configure our Raptor SATA drives in a RAID 0 configuration, so we switched to the Asus boards. Intel has since uncovered a BIOS issue with the beta BIOS. The official BIOS release that both supports Prescott and fixes the RAID issue we uncovered should be posted on Intel’s site soon.

Note that the memory configurations varied slightly. Both the Athlon 64 3400+ system and the Intel processor testbeds used 2 x 512MB, PC3200 DIMMs, for a total of 1GB. The FX-51 testbed was configured with a pair of Mushkin PC3200 registered DIMMs, as recommended by AMD. We ran the systems with the best possible, stable memory timings. The P4 testbeds required latency settings of 2-3-3-6 to remain stable in all benchmarks. The Asus K8V deluxe refused to post at any setting other than the default SPD setting, which is 2.5-3-3-7. As we’ll see shortly, that proved to be a non-issue. The Asus SK8N was stable with timings of 2-2-2-6, but this didn’t seem to overcome the handicap of having to use registered (buffered) modules.

The hard drives were defragged before any major test requiring hard significant hard drive access. Vsync was disabled for all real-time graphics tests. We executed the following command before any test cycle: rundll32.exe advapi32.dll,ProcessIdleTasks. This completes any background idle tasks, and improves benchmark score reproducibility.

Our benchmark suite has evolved over time. However, our suite covers a wide range of applications which are significantly affected by CPU performance. The suite consists of a mix of synthetic and actual applications, but is heavily weighted towards real applications. Here are the tests we ran for this processor preview:

Business Winstone 2004

Business Winstone 2004 is the latest version of Veritest’s Winstone benchmark suite. It consists of a variety of common desktop applications, run in a scripted sequence that resembles actual user usage patterns. Most of these consist of Microsoft Office applications, including Microsoft Project and Access. Also included are Norton Antivirus Professional 2003 and WinZip 8.1

New this year is a second set of four inspection tests designed to yield information on multitasking performance. The first runs Outlook and Internet Explorer in the foreground while performing a file copy in the background. The second runs Excel and Word operations while WinZip runs in the background, archiving files. The final test runs a Norton Antivirus scan in the background while Excel, PowerPoint, Project, Access, FrontPage and WinZip perform foreground chores.

Multimedia Content Creation Winstone 2004

The latest release of Content Creation Winstone updates most of the applications to recent versions. It also shifts away from Windows Media Encoder 7.1 to the current Windows Media Encoder 9. Sound Forge has been replaced with Steinberg’s WaveLab. One note: LightWave is currently running as a single threaded application.

Both Business Winstone and Multimedia Content Creation Winstone 2004 can be ordered from Veritest and delivered on CD-ROM for a nominal shipping charge. They cannot be downloaded.

Adobe After Effects 6.0 Professional

This is an updated version of our earlier After Effects 5.5 test, using the newer version from Adobe. After Effects is a professional video compositing and editing tool. This test runs a scripted set of typical After Effects composting and filter operations and generates a log file with elapsed time data at the end.

DiVX 5.1.1 Encode

We use the freeware VirtualDub and the latest DiVX 5.1.1 codec to compress a very high bitrate 330MB AVI file extracted from the DVD The Rock and originally encoded at full resolution with Indeo 5.1 to about 80MB. The AVI file offers both rapid action and high contrast scenes, making it a challenging scene for any compression scheme. The same file is used in our other video compression tests.

WME 9 Test

We use Windows Media Encoder 9 to encode the above video to a 60MB WMV file. Audio is compressed to 70kbps, and the total bitstream is encoded at roughly 2050 kbps.

Quicktime 6.5 / Sorenson Test

We used QuickTime Professional 6.5 and the highly regarded Sorenson 3 codec compresses our 330MB AVI file to about 75MB, using its highest quality settings.

MusicMatch 8.2 MP3Pro Encode

We use the latest version of MusicMatch to encode a 248MB .WAV file to an 11.8MB MP3Pro file at 64 kbps and note the time.

WMA 9 Encode

We use Windows Media Encoder to compress a 248MB WAV file to 11.3MB at a 70kbps data rate and record the time.

Cinebench 2003

We run Cinebench 2003 to test the software 3D rendering performance using Maxon’s Cinema4D engine. Cinebench also allows us to see how performance varies with multiple CPUs, virtual or real.

LightWave 7.5

NewTek’s LightWave is a highly popular 3D modeling and rendering applications used extensively in Hollywood and elsewhere. We run three different renders from LightWave’s benchmark folder to hammer on the CPU. All LightWave renders take place with two threads enabled.

Discrete 3ds max 5.1

Another popular professional 3D modeling app, 3ds max is multithreaded. We run a variety of rendering tests, and report several results. We rendered five consecutive frames and recorded the rendering time.

PC Mark 2004

The latest iteration of FutureMark’s suite of synthetic tests has expanded on the limited repertoire of the original. FutureMark has added several multithreaded tests, as well as expanded to include storage and graphics. We focus on the memory and CPU tests here.

3D Mark 2003

The latest version of 3D Mark has had its share of controversy. However, it’s useful for gauging how a processor might fare in real-time 3D applications.

3D Gaming Tests

Perhaps no application exercises the system more than current generation 3D games. We use the following games to test the performance of these processors. Note that all results are reported at low resolutions and, in most cases, low detail. While you’d never play a game at these resolutions, running that way serves to isolate CPU performance and negate any potential impact of the graphics card. The games we use include:

Halo for the PC

Dungeon Siege

Flight Simulator 2004

Comanche 4

Serious Sam SE

Unreal Tournament 2003

Splinter Cell

Multitasking Tests

One of Intel’s key value propositions for its new generation of Pentium 4 CPUs is simultaneous multithreading, or what Intel calls Hyper-Threading. We wanted to examine multitasking performance carefully, so we looked at the results of several tests:

Business Winstone multitasking tests

A custom scenario, involving Norton Antivirus and Photoshop Elements 2.0 running simultaneously

PC Mark 2004 multithreading tests

A custom scenario where we run Flight Simulator 2004 and Windows Media Encoder together. However, rather than report a single result, we look at the frame rate over time of FS2004.

With these tests in mind, let’s look at actual performance data.

Given Prescott’s architectural tweaks, we’ll try to analyze the performance results in terms of those changes. Since we have three different Pentium 4 variants, all running at 3.2GHz, comparisons of performance are possible, but we’ll be cautious in our assessment. For one thing, none of these applications have been enhanced for Prescott’s new SSE3 instructions, and some would clearly benefit. But if you’re buying a system today, performance of today’s applications are certainly valid on a new architecture.

Business Winstone 2004 / Multimedia Content Creation Winstone 2004

The Athlon 64 3400+ puts a hurt on all the other processors in the Business Winstone test. At first blush, the difference in scores between the 3400+ and the second place Athlon 64 FX-51 seems a bit mysterious, but it’s probably due to the use of unbuffered memory, which seems to overcome the memory bandwidth deficiency. Prescott trails the rest of Intel’s offerings in this test, though the gap isn’t all that large.

What’s really impressive is that AMD’s 3400+ also cleans up in Multimedia Content Creation Winstone. However, part of this may be due to the fact that multithreading isn’t enabled in CC Winstone’s LightWave test. Still, it’s an impressive showing. The P4EE 3.4GHz places second, ahead of the FX-51.

In both tests, Prescott is essentially in a statistical dead heat with the 3.2GHz Northwood part — slightly behind in Business Winstone and slightly ahead in CC Winstone. Given the branchy nature of business applications, Prescott’s performance in Business Winstone is surprisingly good. Clearly the larger cache, improved branch prediction and better memory handling offsets the deeper pipeline.

Video processing, including applying filters, video compression, and transcoding are increasingly important applications in today’s media-rich computing environment. Note that Windows Media Encoder 9 was set to dual-pass mode.

In general, Intel’s CPUs do well in media encoding. AMD’s processors have tended to trail in this type of application, despite now having SSE2 instructions built in. However, the Athlon 64 3400+ again proves its mettle in a couple of tests, outpacing Intel’s CPUs in the Quicktime/Sorenson compression test and the DiVX 5.1.1 test. However, Intel’s processors still hold sway in WMV9 transcoding and Adobe After Effects processing.

Prescott acquits itself pretty well here in video encoding and transcoding, essentially tying or leading the 3.2GHz Northwood. It even holds its own against the 3.4GHz Northwood in several tests. Audio compression seems to be a different story, as the new Intel CPU lags a bit behind the Northwood. AMD’s 64-bit CPUs do well here, outpacing the 3.2GHz Prescott, but lagging behind the other Intel CPUs.

3ds max, LightWave and Maxon’s Cinebench 4D are professional modeling and animation tools. Here, we test rendering performance, enabling multithreading where needed (3ds max is multithreaded by default).

Performance in these applications is heavily dependent on floating point performance. All of them have been optimized for SSE2, but it’s also clear that the clock rate has some impact. Once SSE2 enters the mix, the Pentium 4 line does very well. Here, the champ is the 3.4GHz Pentium 4 Extreme Edition.

The anomaly here is Prescott. Prescott’s 3D rendering performance trails pretty much everyone in LightWave and lags behind all the Intel processors in 3ds max. We find this a bit puzzling, as this type of rendering code isn’t terribly branchy. The same holds true for the Cinebench test, whose workload is a bit more synthetic. In fact, Prescott would be dead last here, save for its Hyper-Threading capability, which allows it to post a dual CPU score higher than the Athlon 64 single CPU scores.

Perhaps optimizing these applications for SSE3 will offer some boost. That certainly occurred with the older Pentium 4’s, which once performed relatively inefficiently with 3D content creation applications. As optimizations were added for the P4, rendering performance saw a substantial boost.

These are synthetic tests, but can reveal the behavior of key subsystems.

The P4 has classically garnered good results in PCMark 2004, so it’s scores are no real surprise. What is interesting is how well Prescott performs here. In the CPU test, it ties the 3.2GHz Northwood and 3.2GHz P4EE, and places second in the memory test. The PCMark 2004 CPU test has cache locality properties that do not derive any tangible benefit beyond a 512K L2 cache. All performance gains shown across the P4 chips relate to clock speed, not cache size differences.We’ll take a closer look at the memory results in a bit.

When using software vertex shaders, the Athlon 64 3400+ outpaces Prescott, but not by a wide margin. As we’ll see shortly, this result is a leading indicator for game performance. If we look more closely at the 3DMark CPU test, we see the Athlon 64 3400+ performing exceptionally well in CPU test 1, which is more of a DX7-style rendering engine. In the more vertex-shader intensive test, the P4’s perform fairly well, though Prescott does lag a bit.

Our 3D game tests offer a mix of CPU-intensive and memory bandwidth hungry tests. We keep the resolution low, so that the graphics card doesn’t unduly affect the CPU impact.

If you’re primarily a gamer, the CPU of choice here is the Athlon 64 3400+. When mated to the Asus K8V Deluxe motherboard, the 3400+ outpaced all the other processors in most tests, with the exception of Comanche4 and Serious Sam SE. The 3.4GHz Pentium 4 Extreme Edition took the honors in those titles.

Note that frame rates at playable resolutions — 1024×768 or higher — tend to be much more evenly matched, due to the influence of a fast graphics card. But even in those cases, the Athlon 64 3400+ tends to be a bit faster than the pack.

The 3.2GHz Prescott performance is a mixed bag. The new Intel processor actually does pretty well in several tests, edging out the 3.4GHz Northwood in Dungeon Siege and Flight Sim 2004. It only loses to the Northwood 3.2GHz part in one test, Comanche 4. So despite the added pipeline stages, Prescott turns out to be a decent performer in games, albeit overshadowed by AMD. We’re certainly looking forward to checking out the 3.4GHz Prescott when that becomes available.

Intel has suggested that multithreaded performance should improve with Prescott, since specific features were added to the processor to enhance its Hyper-Threading capability. Let’s look at multithreaded performance in a couple of different ways. First up are PCMark 2004’s system multithreading tests.

As we can see in this chart, Prescott does pretty well, essentially staying even with the higher clocked Northwood 3.4GHz CPU. In fact, on the first PCMark test, which simultaneously runs file encryption and file compression algorithms, it gets the highest score of any CPU. This behavior, plus the result of the memory test, explains Prescott’s solid PCMark 2004 score.

Next, let’s look at one of our older multitasking tests, involving Norton Antivirus 2003 and Adobe Photoshop Elements 2.0.

We’ve used this test in the past to gauge multitasking performance. Photoshop Elements 2.0 runs a scripted set of filters on a photograph while Norton Antivirus 2003 runs a virus scan on a fixed directory of files and folders in the background. Prescott outperforms both Northwood processors, although it falls behind the Extreme Edition CPUs. This suggests that cache size may play a role in these tests. But given the deeper pipeline of Prescott, it’s an intriguing showing.

The third multitasking scenario is one that really hammers the system. We run two highly CPU intensive applications, Windows Media Encoder 9 and Microsoft Flight Simulator 2004.

This messy-looking chart shows the frame rate of Flight Simulator 2004 over a 140 second run. Each data point represents the momentary frame rate at roughly 1/2 second intervals. During the first run (labeled “solo”), Flight Simulator 2004 had the entire system to itself. In the second run (labeled “multi”), Windows Media Encoder 9 was busily compressing a 330MB AVI file to an 84MB WMV file while FS2004 was running its test.

Note also that we only ran the Prescott and P4EE 3.2GHz Intel CPUs. We could have added the results of all the CPUs, but the chart would then be nearly impossible to decipher. We also ran the test with the Athlon 64 3400+, which garnered the best overall Flight Simulator 2004 benchmark. The resolution here is a bit higher than our earlier game test, and the graphics features turned up a bit, too, to represent a more playable game experience.

Several interesting things about this chart jump out after a little study. First, both the top and the bottom data lines are both from runs with the Athlon 64. When WME9 was running, the Athlon 64 averaged less than 4 frames per second. We did see one large spike in frame rate, but the curve pretty much remained under 4 fps for the majority of the run. All three Pentium 4 processors performed more poorly when running Flight Sim 2004 solo, but managed to average around 17 frames per second while WME9 was chugging along in the background. The other interesting data point is that Prescott’s average frame rate of 17.2 fps when multitasking was essentially the same as the 3.2GHz P4EE’s 17.5 fps. Of course, the frame rate dipped on occasion, but the point here is that Hyper-Threading clearly has a major impact.

Finally, let’s take a quick look at Prescott’s memory performance using PCMark 2004’s memory tests.

The first two charts deal with raw block reads and write to memory. The 192KB raw block tests fit into L2 cache, but not L1 cache. Since the data resides in the cache, the on-die memory controller of the AMD CPUs don’t come into play. The P4 processors offer superlative performance when reading data in and out of the L2 cache. All the P4 raw block write results are fairly similar, while the reads do vary a bit. Even so, there’s not a lot of differentiation in P4 cache read and write performance for this particular test.

Next up is the 4MB raw block read and write tests. This test breaks even the L3 cache on the P4 Extreme Edition CPUs. Curiously, the integrated memory controller on the Athlon 64s doesn’t seem to help much here. Since we’re essentially streaming data in one direction from contiguous memory locattions, the lower latency offered by the on-die memory controller may not have a large impact.

Interestingly, P4 write performance is roughly the same across all processors. However, Prescott posts a noticeably higher 4MB raw block read result than the other P4’s. Perhaps the deeper buffers and other enhancements to the way Prescott handles memory assist here.

When we move to random access tests, the Athlon XP’s integrated controller seems to once again have relatively little impact here. Perhaps the 192KB block size is still too large to really have an impact here. The 4MB random access tests are relatively even across the Intel processor family, but the Prescott fares quite poorly in the 192KB test.

Note that these block sizes are fairly large. What happens when you try to move small chunks of data in and out of RAM?

The Athlon 64s do handle smaller blocks better when it comes to random access patterns and 4KB block writes to memory. However, the Pentium 4 has a seemingly staggering appetite for memory reads, which is no surprise to anyone who has seen P4 performance increase with memory bandwidth increases.

How much will all this set you back? Let’s look at the latest CPU pricing (quantity 1000):

CPU

Price

Pentium 4 2.8GHz

$178

Athlon 64 2800+

$193

Pentium 4 3.0/3.06GHz

$218

Athlon 64 3000+

$233

Pentium 4 3.2GHz

$278

Athlon 64 3200+

$293

Pentium 4 3.4GHz

$417

Athlon 64 3400+

$417

Athlon 64 FX-51

$725

Pentium 4 Extreme Edition 3.2GHz

$925

Pentium 4 Extreme Edition 3.4GHz

$999

If you look at the pricing table, you’ll see no differentiation between Northwood and Prescott. In fact, Intel is offering identical pricing for Prescott and Northwood processors running at the same clock speed. The Extreme Edition CPUs are a different story, but that’s no surprise. Prescott’s pricing is a pleasant surprise, and Intel even seems to be trying to undercut AMD here.

Simply because Intel’s quantity 1000 pricing is the same doesn’t mean you’ll see identical prices from resellers. But the prices for the different processors should be close. How do you tell apart a Prescott from a Northwood, then? Intel is appending an “E” to the end of the frequency for 90nm Pentium 4s with 1MB of L2 cache. So a 3.0GHz Prescott will be labeled “3.0E”. The most confusing model will be the 2.8GHz P4, which ships in a variety of flavors: 2.8 (Northwood, 533MHz FSB), 2.8C (Northwood, 800MHz FSB) and 2.8E (Prescott, 800MHz FSB).

Of course, the 3.4GHz version of Prescott isn’t currently available, but George Alfs of Intel told us that systems based on the 3.4GHz Prescott would be available by the end of Q1 (end of March). Retail, boxed versions should also be shipping by then.

Should you buy today? The new pricing for Intel CPUs makes them more affordable than in the past (Extreme Edition excepted).

Of course, if you have an older motherboard, Prescott may not be an option, even if it’s using a current generation 865 or 875P chipset. Some motherboards had to undergo a respin to improve power regulation to support Prescott. So if you want to move to the latest Intel technology, you might need a new motherboard. Note that the Asus P4C800-E deluxe we used only required a BIOS update.

A new processor is always difficult to test. When we originally previewed the Athlon 64 FX-51, we were somewhat frustrated by our inability to adequately test its most salient feature: performance on 64-bit applications in a 64-bit operating system.

With Prescott, we were also somewhat hobbled by the lack of SS3-enhanced applications. Not all applications will benefit from SSE3 optimizations. For example, it’s unlikely that office applications will benefit much. And until the DirectX libraries and various graphics drivers are optimized for Prescott, gaming performance may not improve dramatically.

Surprisingly, Prescott disappointed us a bit in the content creation arena. While still holding its own in video processing chores, it lags a bit in audio. Most of all, its performance in 3D content creation seemed sub par.

However, the real strength of Prescott seems to lie in its Hyper-Threading performance. In the majority of our multitasking tests, the Prescott performed as well as the higher clocked 3.4GHz Northwood CPU. In at least one case, Intel’s latest offering even outpaces the 3.4GHz P4 Extreme Edition, so it’s not just a matter of cache size. So if you’re running a system with a lot of windows open, and lots of background processes running — a situation all too common these days — Prescott may be just your cup of tea.

But for the dedicated gamer, the Athlon 64 3400+ is tough to beat, provided you’re not running many background tasks. We were pretty impressed by the performance of the 3400+ when running games as the only major task. On the other hand, the reliance of the Athlon 64 FX-51 on registered memory seems to hobble its performance by comparison. Although you may pick up a bit of performance with a different motherboard, the 3400+ is still a relative bargain. AMD really needs to get out the socket 939 versions, which will support unbuffered SDRAM, if they want to continue to command such a high price for the FX line.

The Pentium 4 Extreme Editions are pricey, and may really be luxury items for most users. However, the P4EE proves to be a productive processor in content creation and video editing applications. Users looking for a good, entry level workstation processor for those types of applications may actually find the high end of the Intel desktop line to be good fits for their needs. The P4EE also does pretty well in games, but the price/performance ratio is fairly unimpressive. Note that systems from major suppliers who incorporate the Extreme Edition may be more cost effective than building your own P4EE-based system.

Recently, more rumors have been flying around the Internet regarding 64-bit support in Prescott. A recent posting on the popular site Slashdot pointed to several articles speculating on Intel’s shift towards something that may resemble AMD’s x86-64. The next several weeks should yield more concrete information on this.

While Prescott did not provide notable performance improvement running at the same GHz as Northwood, we can look forward to much higher performance with 4 GHz parts likely by year end due to its longer pipeline. A longer pipeline permits much higher frequency scaling than Northwood. That, plus its improved multitasking performance, are Prescott’s aces in the hole. And of course, we’ll likely see Xeon spins of Prescott’s core architecture, but with larger caches and multiprocessing capability later in the year. And who knows, maybe all the rumors circulating over the past year about Prescott having latent 64-bit x86 “Yamhill” features built-in are true.

Product:

Intel Pentium 4 with 1MB L2 Cache

Web site:

www.intel.com

Pros:

Improved multithreaded performance; good price

Cons:

Slightly slower in some applications than the old Pentium 4; no 3.4GHz CPU at launch

Summary:

Offering excellent multitasking performance, the new Prescott CPU does lag behind other processors for dedicated gaming and several other applications. The price is right, but you may need a new motherboard.

Price:

$278

Score:

Product:

Pentium 4 3.4GHz with 512KB L2 cache

Web site:

www.intel.com

Pros:

It’s a 3.4GHz part that won’t set you back an arm and a leg.

Cons:

It’s older technology; the slower 3.2GHz Prescott outdoes it in multitasking performance

Summary:

This is the last gasp for Northwood, but it goes out with a bang to beat it’s younger sibling in the clock rate race.

Price:

$417

Score:

Product:

Pentium 4 Extreme Edition at 3.4GHz

Web site:

www.intel.com

Pros:

Intel’s fastest desktop CPU; offers a staggering 2MB of L3 cache.

Cons:

Very expensive

Summary:

Unless you have a specific productivity need or want to own a status symbol, it’s hard to justify the price/performance ratio.

Price:

$999

Score:

Copyright © 2004 Ziff Davis Media Inc. All Rights Reserved. Originally appearing in ExtremeTech.