Why 'quad-pumped' ?

bobsmith1492 · May 11, 2004

In a modern P4 setup, the ram runs at a certain speed, say DDR400, which is actually on a 200 MHz bus running twice per clock. Now, is this just the bus between the north bridge and the memory? It would appear so; now, it also seems like there is a 200 MHz bus between the north bridge and processor; that is what happens when the ram is run synchronously, correct? Also, the multiplier from the processor simply multiplies this 200 MHz bus to get the stated processor speed (i.e. 200MHz * 16 = 3200 MHz), correct? Now, where does the 'quad-pumping' come into play? It seems like the processor and north bridge communicate 4 times per clock, which is why they call it an 800MHz FSB. Anyway... what is the need for this? Does it help the processor relay data to other northbridge activities, such as hard drive access/sound card access/PCI/AGB busses without using bandwidth up between the processor and RAM?
- thanks

Peter · May 11, 2004

The point in all that double and quad pumping is that only DATA transmissions are accelerated. Addressing and all other housekeeping run at the actual clock speed.

"Double pumped" tactics means that new data are presented at every clock edge (up and down transition of the clock signal), rather than at every full clock cycle. Quad pumped busses double that again.

Simply doubling or 4x-ing the MHz figures is technically inaccurate. 200 MHz QDR is nowhere near as fast as 800 MHz SDR would be - because all the addressing and waitstates business still runs at 200 MHz not 800.

The point why it's done is that it's electrically quite a lot easier to do than an actual doubling or quadding of the clock frequency and everything. The purpose of course is getting a higher throughput, such that the fast CPU isn't stalling its work pipeline. P4 in particular highly depends on fast system bus connection, Prescott even more so.

bobsmith1492 · May 11, 2004

The thing is, what is the '800Mhz' bus connect to? The ram is only running at the same, 200 MHz bus, 'doubled' to 400 MHz, so what good does the extra doubling to '800 MHz' even do?

sao123 · May 11, 2004

If i remember the northbridge is the intel equivolent to the entire FSB...
quad pumped is meaning that any two devices on the northbridge (AGP, CPU, RAM, South Bridge Link) Can exchange data 4 times per clock. or any combination thereof.

CTho9305 · May 11, 2004

Originally posted by: Peter
Simply doubling or 4x-ing the MHz figures is technically inaccurate. 200 MHz QDR is nowhere near as fast as 800 MHz SDR would be - because all the addressing and waitstates business still runs at 200 MHz not 800.

Why can't you use 2 clocks for the address lines also? Addresses are just data that get interpreted differently...

EarthwormJim · May 12, 2004

Originally posted by: bobsmith1492
The thing is, what is the '800Mhz' bus connect to? The ram is only running at the same, 200 MHz bus, 'doubled' to 400 MHz, so what good does the extra doubling to '800 MHz' even do?

The ram is run in dual channel configurations typically. So in a way it too is being quad pumped, just in a different manner.

bobsmith1492 · May 12, 2004

"The ram is run in dual channel configurations typically. So in a way it too is being quad pumped, just in a different manner."

AH! Yes, yesyes... very true, thought it is technically the bit width that is changed.

That said, could that be the reason (or part of it) that dual channel on AMD systems doesn't do much (excluding on-chip ram for AMD64s)?

Peter · May 12, 2004

The reason is rather reverse. The Athlon has a much shorter processing pipeline, and much better caching and prefetching (partly due to its much larger L1 cache). Thus, its rather slow front side bus has little negative effect.

The P4 on the other hand is very allergic to pipeline stalls, thus requiring the fastest and bestest path to RAM it can get to show some good performance. Since Intel don't currently put the RAM controller directly into the CPU like AMD is now doing, they need this very high FSB speed and the dual channel RAM arrangement.

CTho9305 · May 12, 2004

Originally posted by: Peter
The reason is rather reverse. The Athlon has a much shorter processing pipeline, and much better caching and prefetching (partly due to its much larger L1 cache). Thus, its rather slow front side bus has little negative effect.

The P4 on the other hand is very allergic to pipeline stalls, thus requiring the fastest and bestest path to RAM it can get to show some good performance. Since Intel don't currently put the RAM controller directly into the CPU like AMD is now doing, they need this very high FSB speed and the dual channel RAM arrangement.

Also, the P4s have longer cache lines, so a program that reads every, say, 200th byte would use less memory bandwidth on an Athlon than a P4 (each would read either x or y bytes from ram, and all but the one you wanted is wasted bandwidth).

Sunner · May 13, 2004

Originally posted by: bobsmith1492
"The ram is run in dual channel configurations typically. So in a way it too is being quad pumped, just in a different manner."

AH! Yes, yesyes... very true, thought it is technically the bit width that is changed.

That said, could that be the reason (or part of it) that dual channel on AMD systems doesn't do much (excluding on-chip ram for AMD64s)?

Yes, the memory bus is doubled on chipsets that support it(i865/875 for P4's).
A regular DDR-DIMM is 64 bits wide, as is the P4 bus.
A 200 MHz DDR 64bit wide bus will transfer 3.2 GB/Sec, so by running in dual channel mode, you double that to 128 bit, hence the same 6.4 GB as the FSB, but you're still at 200 MHz DDR.

uart · May 13, 2004

Originally posted by: bobsmith1492
"The ram is run in dual channel configurations typically. So in a way it too is being quad pumped, just in a different manner."

AH! Yes, yesyes... very true, thought it is technically the bit width that is changed.

That said, could that be the reason (or part of it) that dual channel on AMD systems doesn't do much (excluding on-chip ram for AMD64s)?

Yes that's correct, DDR-Dual Channel AthlonXP systems have potentially twice the memory bandwidth coming into the Northbrigde than is the link bandwidth from the Northbridge to the CPU. This is definitely a bottleneck situation.

Some people go as far as to say that Dual Channel DDR is completely futile on such systems, however it is important to remember that the Northbridge has many devices hanging off it that can meaningfully utilize this "surplus" bandwidth. This is especially true for devices that can access the memory directly (without using CPU) like IDE and AGP controllers

imgod2u · May 13, 2004

Originally posted by: Peter
The reason is rather reverse. The Athlon has a much shorter processing pipeline, and much better caching and prefetching (partly due to its much larger L1 cache). Thus, its rather slow front side bus has little negative effect.

Pipeline length has little if nothing to do with memory bandwidth needs. However, you did hit the money about caching. As for "better", depends on what you deem "better". The Athlon's cache fetches 64 bytes per cacheline and its relatively larger L1, exclusive cache means that it can access memory a bit less often and each memory access transfers less.

On the flip side, if you had a program with high data locality, transfering a bigger cacheline would mean you get more of the information you need at once. The P4 has a 128-byte cacheline size (in 64-byte strides). In programs with high data locality, its cache is able to fetch more useful information per access and feed it to the processor. This is why it thrives on massive bandwidth (more room to prefetch more and more data into cache).

Both approaches have advantages and disadvantages. Smaller cacheline size means more granularity and less "waste" of memory bandwidth in programs with low data locality. Larger cacheline size means you are able to effectively use memory bandwidth to increase cache hit rates and mask memory latency.

The P4 on the other hand is very allergic to pipeline stalls, thus requiring the fastest and bestest path to RAM it can get to show some good performance. Since Intel don't currently put the RAM controller directly into the CPU like AMD is now doing, they need this very high FSB speed and the dual channel RAM arrangement.

Yes and no. Again, a common misconception is that a long-but-narrow design somehow suffers more from waiting than a short-but-wide design. Not really true at all. Both need data to be fed and both suffer if said data isn't fed. Both waste the same amount of transistor idle time, the only difference is, the statistical IPC of the short-but-wide design would be higher.
To illustrate this. Let's use a simplistic example. You have processor A, running at 2 GHz and able to issue at peak 3 instructions per cycle. You have processor B, running at 1 GHz and able to issue a peak 6 instructions per cycle. You have a memory subsystem that runs at 100MHz. It takes (for simplicity's sake) 5 cycles to get data from memory. So on the 2 GHz machine, if a memory fetch is needed, you wait for 20 * 5 = 100 cycles to get data from memory. In that time, your transistors are sitting idle. In that same time, those transistors could've executed 100 * 3 = 300 instructions.
Now let's look at processor B. During that same amount of time, it waits 10*5 = 50 cycles to get data from memory. During that same time, it could've executed 50 * 6 = 300 instructions. The transistors (all those parallel execution, decoding, scheduling units, etc.) that could've been doing work were sitting idle, wasting power.
Both processor wastes the same amount of potential performance per second. The only difference is, since processor B waits for less cycles, it's statistical IPC is higher than the other processor (more than double).

So no, in order to achieve full performance, long, short, whatever pipelines, it doesn't matter. All that matters is how much potential throughput your processor *could* have (be it an IPC of 6 at 1 GHz or an IPC of 3 at 2 GHz) vs how much it actually achieved due to memory bottlenecks.

Different caching methods aimed at different types of memory subsystems will deal with this in different ways. The P4's caching system targets high-bandwidth, high latency memory subsystems. It requires large amounts of bandwidth to fetch more data that could potentially be useful. It achieves a better cache hit rate with less cache. However, this is also very demanding on the memory subsystem.
The other approach is something similar to the K7's, which is to have a more efficient memory usage but at the cost of performance. If your memory is more than enough for your current prefetching scheme, more won't help.
Those who think the K7 simply isn't memory-limited should look at the wonders the integrated memory controller has done for the K8 (which uses a very similar core, almost unchanged save for the 64-bit extension and more efficient integer scheduling method).

VIAN · May 16, 2004

So how would it be quad-pumped in a clock signal, I see no room for it?

I've always thought that the P4 used a 256-bit bus at 200MHz.

sao123 · May 17, 2004

quad Pumped = Full Duplex...
1 read op + 1 write op on clock up.
1 read op + 1 write op on clock down.

bobsmith1492 · May 17, 2004

Ok, now how does 8x AGP run off one clock? Let's see... square wave... 1r\1w on reaching top of wave first time, then 1r/1w when the wave drops off, but that's only 4x if quad-pumped is full duplex like sao123 says. How do you double that for AGP 8x or is there a different process?

VIAN · May 17, 2004

quad Pumped = Full Duplex...
1 read op + 1 write op on clock up.
1 read op + 1 write op on clock down.

Nice. But full 800MHz may never be reached.

sao123 · May 17, 2004

Please refer to sample diagram here:
http://www.intel.com/design/chipsets/875P/pix/875_schematic.gif

Dont confuse 8x AGP with 800Mhz full Duplex FSB.
800Mhz FSB has Max bandwidth of 6.4 GB/Sec

AGP 8x only has Half Duplex capability and only has 2.0 GB/Sec Max Bandwidth. Therefore it reads 2x per clock (1up and 1down.)

Peter · May 17, 2004

sao, sorry, the P4 front side bus is half duplex, the way they ever were.

VIAN · May 18, 2004

So that means 2 reads on rising and 2 writes on falling?

imgod2u · May 21, 2004

Originally posted by: sao123
quad Pumped = Full Duplex...
1 read op + 1 write op on clock up.
1 read op + 1 write op on clock down.

That makes no sense at all. The clock signal is unidirectional on the P4. In order to perform a simultaneous read and write op, it'd have to be bidirectional (which it is not). And even if it were bidirectional, it would not be "quad pumped" It'd simply be double pumped but bi-directional (like the PPC 970's bus).

This has come up and was indeed in one of Anand's articles. If you look at a sine wave (or square wave) signal, it has 2 half-rising edges and 2 half-falling edges. One rising edge is when the rise rate is increasing and the other edge is when the rise rate is decreasing. The same is true for the falling edge. This makes it possible to interprete 4 different data bits from one clock signal.

VIAN · May 21, 2004

It has a rising and a falling and that's it. It doesn't rise half way. It's either on or off and half way might not cut it. There is only ONE rising and ONE falling per cycle.

imgod2u · May 23, 2004

If you look at a sign wave, during the rising edge, there is a point where the rise rate is increasing and the rise rate is decreasing. Same for the falling edge.

bobsmith1492 · May 23, 2004

Digital circuits use square waves, however (on..off..on..off). What's the deal w/ sine waves?? If you use a sign wave in a digital circuit, you just have to use an op amp to change it over to a square wave so it is useable.

The problem on a square wave is... the time from low signal to high is supposed to be instantaneous (and winds up being as close as physically possible) so there is no time to do two different reads/writes, as the whole purpose of the clock is to space out the data.

Ok... looking at one clock (I wrote it down on paper in front of me , starting at the top - left corner of a square wave, one full clock goes from that point to the top-left of the next wave. If your data is transferred at the initial point, then at the top-right (or bottom right, really) of the first square, those are the only two points on the wave where the level changes, and are probably the points used for DDR. For quad data, however..... who knows?? Mebbe they have some sort of modulated wave, or a half-phase shifted wave used in conjunction? Someone needs to find an Intel engineer methinks....

sao123 · May 23, 2004

Acutally I figured it out. I had AMD's double pumped hypertransport confused with the P4 FSB. Hypertransport is a full duplex bus.

Go here Read This! to see how P4 is quad pumped.

remember this when dealing with circuitry.

Data/Processing/Logic is represented by digital circuitry. However in building actual circuits, everything involving electricity is analog.

VIAN · May 23, 2004

A clock cycle is how long it takes to go from 1 to 0 (from the peek of the wave to the bottom of it). So how does that affect your system?

It's from peak to peak. From one point in the Sine Wave to the next same point.

Data/Processing/Logic is represented by digital circuitry. However in building actual circuits, everything involving electricity is analog.

No, there is Analog circuitry and Digital Circuitry. Sure that life is analog, everything, water electricity, the way things flow, it's all Analog. But a Digital Circuit is different than an Analog one.

Analog circuits have positive and negative voltage and digital only has positive. The frequency that chip is run at is a digital frequency generated by a crystal clock (fsb) which then is mutliplied with another digital frequency.

Analog frequencies would require moving parts to change IIRC.

Tom's hardware said that the P4 quad-pumps similarly to the way agp4x works.

Here is the AGP 2.0 Spec in a pdf. Page 79 tells how it delivers data in 4x mode.

Why 'quad-pumped' ?

Diamond Member

Elite Member

Diamond Member

Lifer

Elite Member

Diamond Member

Diamond Member

Elite Member

Elite Member

Elite Member

Member

Senior member

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Elite Member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Lifer

Diamond Member