Technically speaking...why don't AMD cpu's use bandwidth like P4s?

fkloster

Diamond Member
Dec 16, 1999
4,171
0
0
I got guys askin' me this and I don't think it has much to do with P4's 20 stage pipe....what gives?
 

Moohooya

Senior member
Oct 10, 1999
677
0
0
I believe it is due to a larger on board cache. Cache misses are less frequent, and hence fewer memory requests are made.
 

fkloster

Diamond Member
Dec 16, 1999
4,171
0
0
Originally posted by: Moohooya
I believe it is due to a larger on board cache. Cache misses are less frequent, and hence fewer memory requests are made.


Ummm, but the P4 cache is larger than the Atlhon cache 512 vs 256. with your reasoning, the Athlon would need more fsb than the p4?
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
Clockspeed and cacheline size would be my guess. The processor has to request instructions every clock and if the instructions are not in cache it has to read from memory. The prefetch algorithm has to work very aggressively when the processor's clockrate is very high because 1. many requests are made and 2. a cache miss would cause a lot more wasted clockcycles.
Each time a request to memory is made, 64 bytes is transmitted from memory to cache (a cacheline). Even if the processor is only asking for 1 instruction whereas an Athlon would ask for 3 instructions per clock, the P4 still gets a 64 byte cacheline transmitted to cache for every request (per clock if it's not in cache) and the Athlon still gets a 32 byte cachline transmitted to cache for every request (per clock). So with a combination of the P4's relatively higher clockspeed and its larger cacheline size, I would say it would suck up a significantly greater amount of bandwidth than an Athlon would. However, it can benefit more from higher memory bandwidth as with a greater cacheline size, you get a greater chance of a cache hit and coupled with a more aggressive prefetch logic, you can almost eliminate the problem of memory latency.
 

PrinceXizor

Platinum Member
Oct 4, 2002
2,188
99
91
Actually the base FSB clock on AMD's chips is currently higher than Intel's (166Mhz vs. 133Mhz) but Intel has quantispeed architecture so 100*4=400 and 133*4=533 so AMD has a double pump to 333Mhz. That's why you see new specs for 667Mhz and 800Mhz from intel these are incremental jumps in the base with the quantispeed making up the difference, so (166*4)=667Mhz and (200*4) = 800Mhz.

 

Dulanic

Diamond Member
Oct 27, 2000
9,950
569
136
Probably because their using a aging CPU design? The Athlon core design is old and it was designed around the use of SDRAM, theyve made many improvments to it, but redesigning the core to take advantage of faster memory and a new FSB type to make using that faster memory even worthwhile, well its not worth it... why waste the time when you know you already have all this stuff coming with a CPU next year.
 

RaynorWolfcastle

Diamond Member
Feb 8, 2001
8,968
16
81
Originally posted by: PrinceXizor
Actually the base FSB clock on AMD's chips is currently higher than Intel's (166Mhz vs. 133Mhz) but Intel has quantispeed architecture so 100*4=400 and 133*4=533 so AMD has a double pump to 333Mhz. That's why you see new specs for 667Mhz and 800Mhz from intel these are incremental jumps in the base with the quantispeed making up the difference, so (166*4)=667Mhz and (200*4) = 800Mhz.

Actually, Quantispeed is an AMD term I think you're referring to the NetBurst architecture. Either way, regradless of their names, effective transfer rates is what matters and as of now, Intel is way ahead of AMD in that dept.
 

Adul

Elite Member
Oct 9, 1999
32,999
44
91
danny.tangtam.com
well intel trace catch has a roll in it i think. p4 was designed to take advantage of bandwidth from the fsb. Wellit looks that way, where is pm
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
Originally posted by: imgod2u
Clockspeed and cacheline size would be my guess. The processor has to request instructions every clock and if the instructions are not in cache it has to read from memory. The prefetch algorithm has to work very aggressively when the processor's clockrate is very high because 1. many requests are made and 2. a cache miss would cause a lot more wasted clockcycles.
Each time a request to memory is made, 64 bytes is transmitted from memory to cache (a cacheline). Even if the processor is only asking for 1 instruction whereas an Athlon would ask for 3 instructions per clock, the P4 still gets a 64 byte cacheline transmitted to cache for every request (per clock if it's not in cache) and the Athlon still gets a 32 byte cachline transmitted to cache for every request (per clock). So with a combination of the P4's relatively higher clockspeed and its larger cacheline size, I would say it would suck up a significantly greater amount of bandwidth than an Athlon would. However, it can benefit more from higher memory bandwidth as with a greater cacheline size, you get a greater chance of a cache hit and coupled with a more aggressive prefetch logic, you can almost eliminate the problem of memory latency.

Exactly the correct answer, though the P4's L2 has a 128 byte line size vs. the Athlon's 64 byte L2 line size.

The general rule-of-thumb is that if an L2 cache is at least 4-8 times the size of the L1, the size of the L1 will not affect the L2's global miss-rate. While the former condition isn't satisfied by the Athlon, it's cache exclusitivity leads to the conclusion that an Athlon with a 256KB L2 should have roughly the same global L2 miss-rate as the P4 with 256KB L2. Given that the Athlon w/256 KB L2 should have a global L2 miss-rate of around 1%, Northwood w/512 KB L2 should have a global miss-rate of around 1% / 2^.5 = 0.7% (since as a rule-of-thumb miss-rate generally halves when cache size is quadrupled).

Assuming (given an average workload) that the Athlon retires roughly 1.2 x86 instructions/cycle vs. the P4's 1 x86 instruction/cycle and that the average x86 instruction requires 1.4 memory access (one for the instruction itself, and 0.4 from roughly 40% of instructions being loads or stores):

For the P4: (1 instruction/cycle) * (1.4 mem accesses / instruction) * (0.007 L2 missrate) * (128 bytes needed / miss) = 1.25 bytes required from the main memory / cycle => 3.5 GB/sec memory bandwidth needed at 3 GHz

For the Athlon: (1.2 instructions/cycle) * (1.4 mem accesses / instruction) * (0.01 L2 missrate) * (64 bytes needed / miss) = 1.075 bytes required from the main memory / cycle => 2.5 GB/sec memory bandwidth needed at 2.25 GHz

This is just a rough estimate (the numbers are likely a little high), but you get the idea.
 

shr

Member
Nov 10, 2002
47
0
0
few other Reasons, First the athlon does not duplicate L1 cache Data into L2 which is why it can technically be said that Athlon has 384KB of cache.

There is also the fact that x86 is a CISC instruction set and both Athlon and P4 are RISC proccessors that translate CISC intstuctions into RISC beffore excecuting, in average the Athlon gets 3 RISC intructions per CISC instrucions translated whereas the P4 gets 4 in average.


Also a Cache miss has significantly more performace hit on a P4 due to its 20 stage pipeline
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
Originally posted by: shr
few other Reasons, First the athlon does not duplicate L1 cache Data into L2 which is why it can technically be said that Athlon has 384KB of cache.
Stating that the Athlon has "384 KB of cache" may be suitable for public relations, but summing up the two-level cache hierarchy into such a neat little figure is flawed when analyzing cache performance. The fact is that as long as the L2 is at least four to eight times the size of the L1, the size of the L1 has little effect on the global L2 miss-rate...this was well established by Przybylski, Horowitz, and Hennessey in "Characteristics of Performance-Optimal Multi-Level Cache Hierarchies" (ISCA 1989). Exclusitivity is merely necessary in the Athlon to prevent its small L2 (relative to the L1) from becoming useless; the P4's 8 KB L1D cache and its trace cache, estimated to be equivalent to 32 KB, is small with respect to its L2. The two respective designs will then have nearly equal L2 miss-rates given the same L2 size.

There is also the fact that x86 is a CISC instruction set and both Athlon and P4 are RISC proccessors that translate CISC intstuctions into RISC beffore excecuting, in average the Athlon gets 3 RISC intructions per CISC instrucions translated whereas the P4 gets 4 in average.
That's quite far off, Bhandarkar and Ding found that in SPEC CPU95 the average is 1.35 uops/x86 instruction on the Pentium Pro. I've never seen any studies that indicate the Athlon or P4 are dissimilar, especially since the Athon's 6-entry uop decode queue would quickly become saturated if the rate was any higher. Regardless, this has no effect on main memory bandwidth requirements.


Also a Cache miss has significantly more performace hit on a P4 due to its 20 stage pipeline
The cache miss penalty is higher for the P4 due to its higher clock rate; while its clock rate is higher due to its longer pipline, pipeline length alone has little-to-no effect on cache miss penalty. And while a L2 cache miss may be detrimental to performance, part of the higher penalty is amortized due to the P4's larger reorder window size (126 instructions vs. 54 instructions). Again, this has little effect on main memory bandwidth requirements, which is what is being discussed.
 

shr

Member
Nov 10, 2002
47
0
0
Originally posted by: Sohcan
Originally posted by: shr
few other Reasons, First the athlon does not duplicate L1 cache Data into L2 which is why it can technically be said that Athlon has 384KB of cache.
Stating that the Athlon has "384 KB of cache" may be suitable for public relations, but summing up the two-level cache hierarchy into such a neat little figure is flawed when analyzing cache performance. The fact is that as long as the L2 is at least four to eight times the size of the L1, the size of the L1 has little effect on the global L2 miss-rate...this was well established by Przybylski, Horowitz, and Hennessey in "Characteristics of Performance-Optimal Multi-Level Cache Hierarchies" (ISCA 1989). Exclusitivity is merely necessary in the Athlon to prevent its small L2 (relative to the L1) from becoming useless; the P4's 8 KB L1D cache and its trace cache, estimated to be equivalent to 32 KB, is small with respect to its L2. The two respective designs will then have nearly equal L2 miss-rates given the same L2 size.

There is also the fact that x86 is a CISC instruction set and both Athlon and P4 are RISC proccessors that translate CISC intstuctions into RISC beffore excecuting, in average the Athlon gets 3 RISC intructions per CISC instrucions translated whereas the P4 gets 4 in average.
That's quite far off, Bhandarkar and Ding found that in SPEC CPU95 the average is 1.35 uops/x86 instruction on the Pentium Pro. I've never seen any studies that indicate the Athlon or P4 are dissimilar, especially since the Athon's 6-entry uop decode queue would quickly become saturated if the rate was any higher. Regardless, this has no effect on main memory bandwidth requirements.


Also a Cache miss has significantly more performace hit on a P4 due to its 20 stage pipeline
The cache miss penalty is higher for the P4 due to its higher clock rate; while its clock rate is higher due to its longer pipline, pipeline length alone has little-to-no effect on cache miss penalty. And while a L2 cache miss may be detrimental to performance, part of the higher penalty is amortized due to the P4's larger reorder window size (126 instructions vs. 54 instructions). Again, this has little effect on main memory bandwidth requirements, which is what is being discussed.

In the event of a branch missprediction the pipeline has to be flushed, right? which would waste 20 clock cycles (unless its in one of the P4s fast excecution units, then 10). Because of the serial nature of x86 the following instructions would be dependent on the outcome of the brach (?)

about the CISC/RISC thing, mhmm, I tried to find the link and the only thing I could find was on hammer, and it said something similar, 3 instructions to 9 ROPs uncer ideal circumstances, so you perhaps right, my bad in case you care the link is Here

I guess I just got carried out by the architecture talk and kinda diverged of the badwidth subject (oops).
 

KF

Golden Member
Dec 3, 1999
1,371
0
0
>That's quite far off, Bhandarkar and Ding found that in SPEC CPU95 the average is 1.35 uops/x86 instruction on the Pentium Pro.
Each native x86 instruction is translated into a specific group of internal RISC ops. What this benchmark (SPEC CPU95) has to do with it, I don't know.

> I've never seen any studies that indicate the Athlon or P4 are dissimilar, especially
> since the Athlon's 6-entry uop decode queue would quickly become saturated if the rate was any higher.
What has "rate" got to do with it? Again, a particular x86 instructions is translated into a specific group of instructions that are executable internally. No matter how slow of fast the ops are executed, the number remains the same. Yes the ops are different for an Athlon and a P4. Why would they be the same? And the P4 is a completely new design from the PPro/PII/PII.

Although they evidently do design cache on the basis of hit-rate statistics, in point of fact optimized programs readily beat the statistics when speed counts. And programmers do optimize programs when speed is important. For that reason, cache size is the most important characteristic. Cache that runs fast. Therefore if you want a CPU that runs as fast as possible when it counts, you have to consider other things than average statistical hit rate.

>The fact is that as long as the L2 is at least four to eight times the size of the L1,
>the size of the L1 has little effect on the global L2 miss-rate.
What happens if L2 is less than 4 to 8 times L1, something which is desirable if you want as much fast memory to run programs from as possible? What happens when the mis-rate is designed (by the programmer) to be 0 at the times where speed is needed the most?

>..this was well established by Przybylski, Horowitz, and Hennessey in "Characteristics of
> Performance-Optimal Multi-Level Cache Hierarchies" (ISCA 1989).
> Exclusitivity is merely necessary in the Athlon to prevent its small L2
> (relative to the L1) from becoming useless;

I have no doubt the designers of the Athlon know infinitely more about designing a real CPU than whoever these authors may be. Real designers also do extensive statistics on real programs to check.


In general I don't see in this thread a specific explanation of why the P4 requires higher bandwidth to perform on the same level as a slower clocked Athlon.

It is a consequence of the long pipeline. The longer pipeline cause a bigger penalty for branch mispredictions. I think it also leads to more frequent pipeline stalls due to dependant instructions. There is also the case where branching to a new location requires things that are not located in cache. You can make up for the pipeline's penalty by running the CPU faster. But running the CPU faster requires the memory to be faster also to keep the cache supplied at the proportionally higher rate.

 

glugglug

Diamond Member
Jun 9, 2002
5,340
1
81
I would think that the way the P4 caches microinstructions rather than x86 code would make branches in general more expensive, not just in terms of time but in terms of cache space, because the address of the cached microinstruction the branch points to must be cached, not just the logical address branched to. Perhaps the reevaluation of branch targets which may be cached when new code blocks are entered uses some of the bandwidth. Or maybe its *only* caching the cache location of the microinstruction branched to and not caching the logical address, which means that if the branch target is flushed from the cache, the branch instruction itself is also invalidated and must be recached. Any thoughts on this pm?

Also, having hyperthreading turned on is going to raise the bandwidth dependency for what should be obvious reasons...

Sohcan: are you sure P4 cache lines are 128 bytes? I thought all x86 processors since the Pentium had 64 byte cache lines, exactly matching memory burst length. This alone would in fact dramatically increase the bandwidth dependency if true.
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
Originally posted by: KF
>That's quite far off, Bhandarkar and Ding found that in SPEC CPU95 the average is 1.35 uops/x86 instruction on the Pentium Pro.
Each native x86 instruction is translated into a specific group of internal RISC ops. What this benchmark (SPEC CPU95) has to do with it, I don't know.

SPEC CPU95 represented an average workstation workload for the purposes of the study.

> I've never seen any studies that indicate the Athlon or P4 are dissimilar, especially
> since the Athlon's 6-entry uop decode queue would quickly become saturated if the rate was any higher.
What has "rate" got to do with it?
If the Athlon decoded x86 instructions into 3 uops on average, the microprocessor would often stall on fetch because its decode queue, with 6 entries that are issued at a rate of 3 uops/cycle, would quickly become a bottleneck.

Again, a particular x86 instructions is translated into a specific group of instructions that are executable internally. No matter how slow of fast the ops are executed, the number remains the same.
I never said execution rate had an impact on uop translation.

Yes the ops are different for an Athlon and a P4. Why would they be the same? And the P4 is a completely new design from the PPro/PII/PII.
There may be slight differences, but the vast majority of x86 instructions used are decoded into 1-2 uops, 2 if the instruction is register-memory.

>The fact is that as long as the L2 is at least four to eight times the size of the L1,
>the size of the L1 has little effect on the global L2 miss-rate.
What happens if L2 is less than 4 to 8 times L1, something which is desirable if you want as much fast memory to run programs from as possible? What happens when the mis-rate is designed (by the programmer) to be 0 at the times where speed is needed the most?

>..this was well established by Przybylski, Horowitz, and Hennessey in "Characteristics of
> Performance-Optimal Multi-Level Cache Hierarchies" (ISCA 1989).
> Exclusitivity is merely necessary in the Athlon to prevent its small L2
> (relative to the L1) from becoming useless;

I have no doubt the designers of the Athlon know infinitely more about designing a real CPU than whoever these authors may be. Real designers also do extensive statistics on real programs to check.
Do you care to dispute their findings? The fact that you don't know the authors doesn't lead much credibility. The relationship between global and local multilevel hit-rates is well understood, obviously also by the K7 design team given that they decided to make the L2 cache exclusive...FYI, John Hennessey, from Stanford, co-championed RISC in the early 80s, and his MIPS project became the MIPS R2000, the first commercial RISC microprocessor. He and Dave Patterson from Berkeley are considered the "fathers" of modern computer architecture.

In general I don't see in this thread a specific explanation of why the P4 requires higher bandwidth to perform on the same level as a slower clocked Athlon.

It is a consequence of the long pipeline. The longer pipeline cause a bigger penalty for branch mispredictions. I think it also leads to more frequent pipeline stalls due to dependant instructions. There is also the case where branching to a new location requires things that are not located in cache. You can make up for the pipeline's penalty by running the CPU faster. But running the CPU faster requires the memory to be faster also to keep the cache supplied at the proportionally higher rate.

No, pipeline length alone has nothing to do with main memory bandwidth. I've already explained the cause: memory bandwidth is needed to service blocks for L2 cache misses. This is dependent on the number of memory references per cycle (ISA dependent), L2 cache miss-rate (a function of block size, associativity, size of L2 victim cache, and cache size, independent of L1 size as long as the usable L2 is 4-8 times the size of the L1), IPC, clock rate, and L2 block size. This is a well understood property; read chapter 5 of "Computer Architecture: A Quantitative Approach" by Hennessey and Patterson and "Using cache memory to reduce processor-memory traffic" by J.R. Goodman.

A longer pipeline potentially increases the miss-prediction penalty, but it alone has no immediate affect on the frequency in which L2 blocks must be serviced from main memory. It does potentially decreasing the IPC due to miss-speculation penalty and the clock rate which affects main memory bandwidth, but given two microprocessors with the same miss-prediction penalty and the same clock rate, pipeline length will not affect main memory bandwidth. I'm a graduate student researching computer architecture; these are not wild concepts but rather basic conclusions of memory systems.
 

KF

Golden Member
Dec 3, 1999
1,371
0
0
>Do you care to dispute their findings?
I wouldn't dispute findings to the extent that findings are facts. I like facts. Since those facts from those people have no application when the cache ratios are what they happen to actually be in the case of the Athlon, why they should be relevant you have not explained.

> The fact that you don't know the authors doesn't lead much credibility.
No credibility is required, since I presented the full argument. Besides the argument, there is the actual performance of the
Athlon which performs in accordance with it.

> The relationship between global and local multilevel hit-rates is well understood,
And statististical conditions have to yield to actual conditions. As I mentioned, running real code and examiniing the results is a gigantic factor in the design of x86 CPUs. With the appropriate utilities, you can in fact get internal CPU data for lots of things because it is so important that Intel has designed in the capability. For several generations of CPUs, Intel has depended on getting programmers to alter their practices to suit the CPU in order to derive a big percentage of the performance potential. If the charactersitics of actual code had not changed, Intel CPUs would have much lower performance. AMD has to design in accordance, or go out of business; a huge incentive to getting cache sizes realistically optimized.

>FYI, John Hennessey, from Stanford, co-championed RISC in the early 80s, and
> his MIPS project became the MIPS R2000, the first commercial RISC
> microprocessor. He and Dave Patterson from Berkeley are considered the
> "fathers" of modern computer architecture.
The designers of the Athlon are also fathers, fathers of a real microprocessor, one which no doubt stomps the cr*p out of a MIPS R2000.

>If the Athlon decoded x86 instructions into 3 uops on average, the microprocessor
> would often stall on fetch because its decode queue, with 6 entries that are issued
> at a rate of 3 uops/cycle, would quickly become a bottleneck.

6 spots with only 3 things to go into them would seem to be sufficient. Seeing as how it would be remarkable if a programmer could write a program that would actually allow the Athlon to sustain an execution rate of 3 micro-ops per cycle, the decode queue should not be a bottleneck, even if it did stall at a over a 3 micro-op rate.

>No, pipeline length alone has nothing to do with main memory bandwidth.
As if I said it did. My explanation is similar to yours, only it resorts to less jargon, less appeal to authority, and less gobbledegook. The way you explain it, it sounds like it has something to do with different cache hit rates, which it doesn't.

>A longer pipeline potentially increases the miss-prediction penalty ...
>but given two microprocessors with the same miss-prediction penalty
>and the same clock rate, pipeline length will not affect main memory bandwidth.
A longer pipeline in itself increases the misprediction penalty. The P4 does not reduce misprediction below the Athlon to undo it, probably because there is no way to do so. The way to make up for it is a higher clock. A higher clock needs a higher memory bandwidth.

> I'm a graduate student researching computer architecture.
Not exactly a surpise. It is always a pleasure to see posts from very knowledgeable people such as yourself.

>these are not wild concepts but rather basic conclusions of memory systems.
I guess when people take issue with what we have said, we are offended. My apologies. If I could phrase things just right, maybe I could avoid offending people, although not entirely I am sure. I am always surprised at how people take things.
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
Originally posted by: KF
>Do you care to dispute their findings?
I wouldn't dispute findings to the extent that findings are facts. I like facts. Since those facts from those people have no application when the cache ratios are what they happen to actually be in the case of the Athlon, why they should be relevant you have not explained.
Fine, I'll give you a fact. The fact is that the global miss rate for a level 2 cache is equal to the miss rate of the L1 cache times the local miss rate of the L2 cache. The fact is that the local miss rate of the L2 cache varies dependent on the L1 miss rate, and the combined effect is that the L2 global miss-rate, which is what is important when considering multilevel cache missrate, is not dependent on the presence and size of the L1 as long as the L2 is 4-8 times the size of the L1. The has been well understood for years, it's not my fault if you have neither the patience, education, or experience to understand this.

> The fact that you don't know the authors doesn't lead much credibility.
No credibility is required, since I presented the full argument. Besides the argument, there is the actual performance of the Athlon which performs in accordance with it.
You presented absolutely no argument about the relationship between L1 and L2 cache size and global L2 hitrate. The Athlon's hierarchy is absolute proof that the K7 designers took not only the correct, but the only approach to exclusitivity in the L2 given the large L1 cache.

> The relationship between global and local multilevel hit-rates is well understood,
And statististical conditions have to yield to actual conditions. As I mentioned, running real code and examiniing the results is a gigantic factor in the design of x86 CPUs. With the appropriate utilities, you can in fact get internal CPU data for lots of things because it is so important that Intel has designed in the capability. For several generations of CPUs, Intel has depended on getting programmers to alter their practices to suit the CPU in order to derive a big percentage of the performance potential. If the charactersitics of actual code had not changed, Intel CPUs would have much lower performance. AMD has to design in accordance, or go out of business; a huge incentive to getting cache sizes realistically optimized.
Microprocessor research and design, in both academia and the industry, is analyzed through existing programs and workloads. If you think design and research is performed solely through statistical analysis, you are sorely mistaken. And please don't turn this into some silly Intel vs. AMD fight, flokster's thread deserves more than that.

>FYI, John Hennessey, from Stanford, co-championed RISC in the early 80s, and
> his MIPS project became the MIPS R2000, the first commercial RISC
> microprocessor. He and Dave Patterson from Berkeley are considered the
> "fathers" of modern computer architecture.
The designers of the Athlon are also fathers, fathers of a real microprocessor, one which no doubt stomps the cr*p out of a MIPS R2000.
Do you take pride in the fact that the Athlon "stomps the cr*p" out of the VAX-11? The MIPS R2000 is nearly two decades old. And I'd still like to hear some explanation why Patterson's publishings are incorrect, especially considering many of the architects for the K7 likely researched under him. You seem to be holding some grudge over research, which is quite humorous since senior architects at microprocessor companies tend to come from academia, and vice versa.

>If the Athlon decoded x86 instructions into 3 uops on average, the microprocessor
> would often stall on fetch because its decode queue, with 6 entries that are issued
> at a rate of 3 uops/cycle, would quickly become a bottleneck.
6 spots with only 3 things to go into them would seem to be sufficient. Seeing as how it would be remarkable if a programmer could write a program that would actually allow the Athlon to sustain an execution rate of 3 micro-ops per cycle, the decode queue should not be a bottleneck, even if it did stall at a over a 3 micro-op rate.
If the Athlon decoded, on average, an x86 instruction to 3 uops, then its 3 decoders would decode, on average, 9 uops/cycle. This would easily fill up its decode queue.

>No, pipeline length alone has nothing to do with main memory bandwidth.
As if I said it did. My explanation is similar to yours, only it resorts to less jargon, less appeal to authority, and less gobbledegook. The way you explain it, it sounds like it has something to do with different cache hit rates, which it doesn't.
Funny, I don't consider a correct answer and valid modeling equations to be "gobbledegook." I'll lay things out, which I didn't do originally because imgod2u did a good job:

What affects L2 -> memory bandwidth? The number of L2 misses/second times the miss penalty bytes.

The miss penalty is primarily dependent on the L2 block size; this is of great importance, since every miss in the L2 must have the block brought in by the main memory. Sub-blocking, or dividing the "transfer" block size into smaller blocks, can yield the spatial benefits of larger blocks with the lower-bandwidth requirements of smalller blocks, but this hasn't been used in any machines in a while, at least since the IBM 360/85 IIRC. With hardware prefetch, a miss might require more memory accesses; many systems with hardware prefetch use stream buffers, which, upon an L2 miss, request subsequent blocks with a unit stride (for instructions) or possibly some larger stride (for data). This increases the amount of memory requests, but hopefully if the prefetch is fruitful, compulsory L2 misses will be decreased, affecting L2 miss-rate.

What affects L2 misses/second? The number of memory references per instruction times the number of instructions retired / cycle times the number of L2 misses / global reference times the clock rate.

The number of memory references per instruction is dependent on the ISA. Each instruction requires an instruction access, plus some number of data operand accesses. On average, an x86 instruction requires 0.4-0.5 data memory references per instruction.

The number of instructions retired / cycle and the clock rate are dependent on a large number of microarchitectural, implementation, and process technology parameters, some of which have been covered and are too numerous to cover in full.

The number of L2 misses / global reference is dependent primarily, or made to be so through exclusitivity, on L2 parameters, such as cache size, block size, and associativity. Hardware prefetching affects compulsory misses as well, and the presence of victim caches will reduce the number of misses due to associativity conflicts.

This leads back to the rough estimate, with its previously mentioned caveats, demonstrating the average behavior that I posted earlier:

For the P4: (1 instruction/cycle) * (1.4 mem accesses / instruction) * (0.007 L2 missrate) * (128 bytes needed / miss) = 1.25 bytes required from the main memory / cycle => 3.5 GB/sec memory bandwidth needed at 3 GHz

For the Athlon: (1.2 instructions/cycle) * (1.4 mem accesses / instruction) * (0.01 L2 missrate) * (64 bytes needed / miss) = 1.075 bytes required from the main memory / cycle => 2.5 GB/sec memory bandwidth needed at 2.25 GHz

>A longer pipeline potentially increases the miss-prediction penalty ...
>but given two microprocessors with the same miss-prediction penalty
>and the same clock rate, pipeline length will not affect main memory bandwidth.
A longer pipeline in itself increases the misprediction penalty. The P4 does not reduce misprediction below the Athlon to undo it, probably because there is no way to do so. The way to make up for it is a higher clock. A higher clock needs a higher memory bandwidth.
Thank you, this is exactly one of the parameters that effect main memory bandwidth that I have been describing.

> I'm a graduate student researching computer architecture.
Not exactly a surpise. It is always a pleasure to see posts from very knowledgeable people such as yourself.

>these are not wild concepts but rather basic conclusions of memory systems.
I guess when people take issue with what we have said, we are offended. My apologies. If I could phrase things just right, maybe I could avoid offending people, although not entirely I am sure. I am always surprised at how people take things.
Okay......this seems to be one of the first times I've actually received such a negative reaction to merely explaining the results of decades of research and design in computer architecture. I'd like to know what the source of this antagonism towards my area of study is.

Sohcan: are you sure P4 cache lines are 128 bytes? I thought all x86 processors since the Pentium had 64 byte cache lines, exactly matching memory burst length. This alone would in fact dramatically increase the bandwidth dependency if true.
Yep, page 45. Having the DRAM burst length equal to the block size is somewhat beneficial towards main memory latency, though the drawback of larger block sizes is not nearly as dramatic as the number of times that the block size is larger than the DRAM burst length due to memory controller concurrency. In fact, RDRAM and DDR-II are both spec'd to have 32-byte burst lengths.
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
Didn't you read the disclaimer when you registered for the forums? Paragraph 2, sentence 5: "Anandtech and its forum members are not responsible for any damage, either physical or psychological, that you receive while participating in the forum."
 

KF

Golden Member
Dec 3, 1999
1,371
0
0
>If the Athlon decoded, on average, an x86 instruction to 3 uops, then
> its 3 decoders would decode, on average, 9 uops/cycle. This would easily
> fill up its decode queue.
My original complaint was that your logic did not support your conclusion (3 ops overflowing enough positions for 6). Since I didn't recall much about the Athlon's architecture, I took a look at again. (In the Athlon code optimization guide, 22007.pdf, appendix A.) Suffice it to say that the Athlons queues are large enough that 9 micro-ops per cycle would not fill up, or stall. I don't know where you got your numbers from; maybe the K6.

"...A DirectPath instruction is limited to those x86
instructions that can be further decoded into one or two OPs...."

The Athlon repacks x86 instructions into MacroOPs that may contain 2 ops. (or possibly 3 in the next case?)

"Uncommon x86 instructions requiring two or more MacroOPs
proceed down the VectorPath pipeline...."

"The ICU takes the three MacroOPs per cycle from the early
decoders and places them in a centralized, fixed-issue reorder
buffer. This buffer is organized into 24 lines of three MacroOPs
each."

24 lines x 3 MacrOPS x 2 OPs = 144 OPs. So there is room for 144 ops (not 6) without overflowing or stalling. (Typical of Athlon overkill.)

shr's original statement, which you said was impossible, was " in average the Athlon gets 3 RISC intructions per CISC instruction... "
Now you are using 9 ops to obtain an impossibility, but that still would not "easily fill up its decode queue." If you are are going to obfuscate with obscure technical references in order to browbeat people into silence, you are going to have to be far more vague. Do not ever refer to something which can be checked.

>The fact that you don't know the authors doesn't lead much credibility

The fact that you don't know the Athlon doesn't lend much credibility


>Okay......this seems to be one of the first times I've actually received such a
> negative reaction to merely explaining the results of decades
> of research and design in computer architecture. I'd like to know what the
> source of this antagonism towards my area of study is.
Since I presented no antagonism to "explaining the results of decades of research" , there is no source of it for me to explain. My "antagonism" is that directly from the premises you present (4 to 1 cache ratio), it is inapplicable. Athlon's cache is 2 to 1 ( 256K to 128K)

Sohcan: "The fact is that as long as the L2 is at least four to eight times the size of the L1, ..."

>Do you take pride in the fact that the Athlon "stomps the cr*p" out of the VAX-11? The MIPS R2000 is nearly two decades old.
I guess you prefer to miss that point too. The theories and implementations from decades ago are primitive and have been superceded next to what goes into today's processors. To say the Athlons cache is "suitable for public relations", and use an outdated computer to back the claim that a 4 to 1 L1/L2 cache ratio is required for optimal performance, is absurd. In every way the resources on the Athlon die are overkill.

> And I'd still like to hear some explanation why Patterson's publishings are incorrect, ...
Since the idea that anyone's publishings are incorrect was not presented by me, you will have to explain why they are incorrect, if they are, to me.

>.. especially considering many of the architects for the K7 likely researched under him.
> You seem to be holding some grudge over research, which is quite humorous
> since senior architects at microprocessor companies tend to come from academia, and vice versa.

Why do I seem to you to be holding a grudge against research? I stated that research was the reason for the Athlon's and P4's final characteristics. It is true the research was done on _real_ processors running _real_ programs, and the implementations had to pass the rigors of _really_ having to perform, compete and sell, but it was _real_ research. Why do you have a grudge against _real_ research?


First I got this:
>Sohcan: No, pipeline length alone has nothing to do with main memory bandwidth..

Then this:
>>KF: A longer pipeline in itself increases the misprediction penalty...
>Sohcan: Thank you, this is exactly one of the parameters that effect main memory bandwidth that I have been describing.

In between, there were extended outbursts of interspersed gobbledegook and brow-beating, as appropriate for a graduate student.

Oh yeah. I am partial to the Athlon. It stomped the corresponding Intel chips in my price range, so that is what I bought. (For a while before that I bought Intel Celerons, when they were very slightly crippled and under-clocked PIIs.) IAC, I've looked over a lot more data on the Athlon because it doesn't make any sense to buy a P4. Although the architectural info available from AMD for the Athlon is more revealing than Intel's was for the Celeron-PIIs, it is just as impossible to predict what an Athlon will do in practice from it. That is to be expected with anything whose component parts interact in such a complex way. In general, Intel designs their chips in a less robust way, so that it easier to run into a programming pitfall. It does not matter much for Intel because programmers will adjust to anything Intel gives them. AMD can't get away with that.
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
Originally posted by: KF
>If the Athlon decoded, on average, an x86 instruction to 3 uops, then
> its 3 decoders would decode, on average, 9 uops/cycle. This would easily
> fill up its decode queue.
My original complaint was that your logic did not support your conclusion (3 ops overflowing enough positions for 6). Since I didn't recall much about the Athlon's architecture, I took a look at again. (In the Athlon code optimization guide, 22007.pdf, appendix A.) Suffice it to say that the Athlons queues are large enough that 9 micro-ops per cycle would not fill up, or stall. I don't know where you got your numbers from; maybe the K6.

"...A DirectPath instruction is limited to those x86
instructions that can be further decoded into one or two OPs...."

The Athlon repacks x86 instructions into MacroOPs that may contain 2 ops. (or possibly 3 in the next case?)

"Uncommon x86 instructions requiring two or more MacroOPs
proceed down the VectorPath pipeline...."

"The ICU takes the three MacroOPs per cycle from the early
decoders and places them in a centralized, fixed-issue reorder
buffer. This buffer is organized into 24 lines of three MacroOPs
each."

24 lines x 3 MacrOPS x 2 OPs = 144 OPs. So there is room for 144 ops (not 6) without overflowing or stalling. (Typical of Athlon overkill.)

shr's original statement, which you said was impossible, was " in average the Athlon gets 3 RISC intructions per CISC instruction... "
Now you are using 9 ops to obtain an impossibility, but that still would not "easily fill up its decode queue." If you are are going to obfuscate with obscure technical references in order to browbeat people into silence, you are going to have to be far more vague. Do not ever refer to something which can be checked.

>The fact that you don't know the authors doesn't lead much credibility

The fact that you don't know the Athlon doesn't lend much credibility
Close, but no cigar. What you were describing is the reorder buffer, which manages in-order instruction retirement and maintains precise exceptions under speculation and out-of-order execution. Entries in the reorder buffer (which has 72 entries, not 144, if you read the next sentance after the one you quoted) are allocated at instruction issue and decallocated at instruction retirement.

What I was referring to earlier was the pathway from decode up to instruction issue, which occurs earlier in the pipeline. If you look on page 207 (227 according to acroread), it clearly indicates that the path from the hardwired control ("DirectPath decoder") has a 6-entry queue: "up to six DirectPath x86 instructions can be passed into the DirectPath decode pipeline." This is not a problem, since in the common case the 3 "DirectPath" decoders each produce 1 or 2 macro-ops each cycle and the microcoded control (for those x86 instructions that decode to more than 2 macro-ops) produces a maximum of 3 macro-ops per cycle. This verifies what I said earlier, in which the 6-entry decode queue would stall fetch if each hardwired decoder was producing more than 2 macro-ops per cycle on average.

In fact the section you quoted further illustrates that the vast majority of x86 instructions on the Athlon are decoded into 1 or 2 macro-ops, not an average of 5 as was previously indicated. The first paragraph under "DirectPath Decoder" clearly indicates that each of the three hardwired decoders handle those instructions that decode into 1 or 2 macro-ops. The "VectorPath decoder", or the microprogrammed control, handles those that decode to more than 2 macro-ops: "Uncommon x86 instructions requiring two or more MacroOps proceed down the VectorPath pipeline". Microprogrammed control is SLOW, and was one of the first techniques to be axed in the 80s with RISC instruction sets, with x86 MPUs only keeping it around for compatibility with uncommon, complex instructions. Rest assured that if the P3 and Athlon ran out of microcode, which would happen if the average x86 instruction decoded into 5 macro-ops, they would be as slow as mollases. Fortunately, in the vastly common case decode is handled by the hardwired control.

Now that we have that issue settled....


>Okay......this seems to be one of the first times I've actually received such a
> negative reaction to merely explaining the results of decades
> of research and design in computer architecture. I'd like to know what the
> source of this antagonism towards my area of study is.
Since I presented no antagonism to "explaining the results of decades of research" , there is no source of it for me to explain. My "antagonism" is that directly from the premises you present (4 to 1 cache ratio), it is inapplicable. Athlon's cache is 2 to 1 ( 256K to 128K)
I was applying it to the P4 to demonstrate that its L2 likely has a 30-40% lower global miss-rate, due to the previously mentioned L1->L2 ratio behavior, its larger cache size, and somewhat offset by the Athlon's wider associativity (16-way vs. 8-way on the P4's L2). This is more than reasonable given that Paul DeMone estimated the Athlon and P4's L2 at 256KB to have nearly equal miss-rates (1% and 1.1%, respectively) using the well-known VAX study on cache miss-rates.

I'm dreadfully sorry about not clearing this up earlier, maybe if you hadn't flown off the handle it could have been resolved.


>Do you take pride in the fact that the Athlon "stomps the cr*p" out of the VAX-11? The MIPS R2000 is nearly two decades old.
I guess you prefer to miss that point too. The theories and implementations from decades ago are primitive and have been superceded next to what goes into today's processors.
Is that your professional opinion? Universally branding theories and implementations that are decades old as primitive is quite ludicrous. Tomasulo's dynamic scheduling algorithm is at the heart of nearly every high-performance microprocessor today, save the Sun US-III and Itanium family, yet it was developed 35 years ago for the IBM 360/91. Most of the techniques and understanding that go into caches and TLBs are decades old. The structures that support speculative execution are nearly two decades old.

To say the Athlons cache is "suitable for public relations", and use an outdated computer to back the claim that a 4 to 1 L1/L2 cache ratio is required for optimal performance, is absurd. In every way the resources on the Athlon die are overkill.
Wow, what reading comprehension you've displayed there. Let's look at what I said again: "Stating that the Athlon has "384 KB of cache" may be suitable for public relations, but summing up the two-level cache hierarchy into such a neat little figure is flawed when analyzing cache performance." I'd like to know where I said that "the Athlon's cache is suitable for public relations." It was quite clear that I was attacking the commonly held misconception about exclusitivity, not the principle or the Athlon's cache. As for "a 4 to 1 L1/L2 cache ratio is required for optimal performance", I NEVER stated that. I was using the principle as part of the cache analysis (in my first post) and to demonstrate the necessity of exclusitivity in the Athlon's L2 to prevent the L1/L2 ratio from effectively being 1:1 (in my second post). Maybe if you had asked me to clarify myself instead of attacking me, this misunderstanding would not have occurred.

> And I'd still like to hear some explanation why Patterson's publishings are incorrect, ...
Since the idea that anyone's publishings are incorrect was not presented by me, you will have to explain why they are incorrect, if they are, to me.
You attacked the credibility of the authors on two occasions, which I've noticed is an often-used tactic when one wants to refute a finding or claim but cannot find any technical reason:

"I have no doubt the designers of the Athlon know infinitely more about designing a real CPU than whoever these authors may be"

"The designers of the Athlon are also fathers, fathers of a real microprocessor, one which no doubt stomps the cr*p out of a MIPS R2000."

>.. especially considering many of the architects for the K7 likely researched under him.
> You seem to be holding some grudge over research, which is quite humorous
> since senior architects at microprocessor companies tend to come from academia, and vice versa.

Why do I seem to you to be holding a grudge against research?
Let's see...in reference to Horowitz, Hennessey, and Przybylski: "Real designers also do extensive statistics on real programs to check." The implication that researchers do not extensively test ideas on real programs is quite false. There are also the previously metioned attacks on the authors.

I stated that research was the reason for the Athlon's and P4's final characteristics. It is true the research was done on _real_ processors running _real_ programs, and the implementations had to pass the rigors of _really_ having to perform, compete and sell, but it was _real_ research. Why do you have a grudge against _real_ research?
I have no such grudge, I do real research.


First I got this:
>Sohcan: No, pipeline length alone has nothing to do with main memory bandwidth..

Then this:
>>KF: A longer pipeline in itself increases the misprediction penalty...
>Sohcan: Thank you, this is exactly one of the parameters that effect main memory bandwidth that I have been describing.
Nice selective quoting, I was referring to clockspeed. While clockspeed and IPC affect main memory bandwidth requirements and are individually affected by pipeline length, it is still the case that pipeline length alone has little affect on main memory bandwidth. This is especially true in the case of the P4, for which in many cases IPC * clock rate for the fastest P4 is roughly equal to the same figure for the fastest Athlon.

In between, there were extended outbursts of interspersed gobbledegook and brow-beating, as appropriate for a graduate student.
Calling what you don't understand "gobbledegook" does not make it so. Given that I received a few pm's thanking me for the explanation, it seems others thought differently.

Oh yeah. I am partial to the Athlon.
Nothing wrong with that, I think that the Athlon is arguably the finest x86 microprocessor ever produced. Next time don't get out of hand by taking a criticism of PR junk related to the Athlon as a derision of the microprocessor itself.

Now that you've successfully taken most of the content of the thread off-topic, let me know when you would like to discuss the topic at hand; otherwise I'm done quarrelling. I would be happy to explain these concepts, I just don't appreciate being attacked without provocation.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |