AMD Barcelona Thoughts/Questions

zsdersw · Sep 17, 2007

I'm not talking about "an" increase.. I'm talking about a significant increases.. and I have been talking about significant increases (i.e. greater than 10%) all along.

Kuzi · Sep 17, 2007

Originally posted by: Viditor
In what way do you think K10 is different with respect to L2 cache?

I should have made my post more clear, the L2 cache in K8 is only 64bits wide, while in K10 it's 128bit. Also the memory controller is improved a lot compared to the one in K8, so that not only helps K10 make better use of faster memory, but I'd say also allow the CPU to make better use of larger caches.

I like this article at xbitlabs about K10: xbitlabs

That's a very difficult question because there are so many other variables (different distros for example). However, I have never heard of Linux hitting 4 issues per clock either, except in some very rare (almost purpose-built) apps...that said, I'm not incredibly familiar with Linux, so maybe someone else can answer this?

From reading your post and other posts on this thread, I guess the difference between 4 issue or 3 issue would not be that huge, since the 4 issue is mostly unused. And maybe that explains why AMD didn't not implement it in K10.

Originally posted by: Nemesis 1
One must keep in mind that while Intel was working on the merom core future cores were also being worked on . Even tho 4 issue seldom gets a hit. Intel created it with the future in mind. So when we do see Nehalem with its completely new type of HT. Hperthreading. I see 4issue being hit alot .
So intel you could say added 4issue core to perfect the logic on the chip for up and coming releases . Does this make sense or not?

Good post, never thought of it that way

Originally posted by: jones377
It has more to do with the code itself than the decoders, although they do play a part. The IPC on most integer code is only around 1.0 on average. But even on such code, having a 4th decoder can improve performance a bit, by a few percent.

I agree with your post, but maybe as Nemesis mentioned, future code will make better use of this, so it may be good to have more decoders in a CPU for the slight performance increase now, and for a bigger increase in the future.

*I would like to add one more thing about CPU scaling in Barcelona, since many websites said K10 starts to perform better and better as the frequency increases. I believe there is some truth to this, but the advantage is not as big as some people might think.

For example the from the anandtech tests, a 25% frequency increase from 2GHz to 2.5GHz gave an average of 19% on those tests. I checked another website today testing a QX6850 (3GHz) and compared it's results with a Q6600 (2.4GHz). Going from 2.4GHz to 3GHz is also a 25% increase, and the QX6850 gained about 17%. So only a 2% difference for every 25% gain in frequency.

Of course these are not accurate numbers, as we'll have to wait for more tests on K10 with more stable hardware. From those preliminary results, I'll give an example. Lets say Phenom at 2GHz will be 4% slower than Penryn at 2GHz also. When both are running at 2.5GHz the difference should only be 2%, and at 3.1-3.2GHz they will perform equal to each other. This is not taking into account better multithreading scaling on K10, so on highly multithreaded software, the K10 advantage would get a bit bigger.

These are rough numbers guys, so take them with a grain of salt. I'm still hoping Phenom would perform better than what I'm expecting, but right now, I still think it will be 5-15% slower (at low frequencies) than Penryn clock for clock.

dmens · Sep 17, 2007

Originally posted by: Kuzi
From reading your post and other posts on this thread, I guess the difference between 4 issue or 3 issue would not be that huge, since the 4 issue is mostly unused. And maybe that explains why AMD didn't not implement it in K10.

3 to 4 issue will yield big returns if the entire machine is adjusted to utilize the additional throughput (i.e. C2D). My guess is that AMD decided it was not worth the effort on K10, as opposed to the change not giving enough back warrant the effort, because it definitely does have tangible gains across the board.

Viditor · Sep 17, 2007

Originally posted by: dmens

Originally posted by: Kuzi
From reading your post and other posts on this thread, I guess the difference between 4 issue or 3 issue would not be that huge, since the 4 issue is mostly unused. And maybe that explains why AMD didn't not implement it in K10.

Click to expand...

3 to 4 issue will yield big returns if the entire machine is adjusted to utilize the additional throughput (i.e. C2D). My guess is that AMD decided it was not worth the effort on K10, as opposed to the change not giving enough back warrant the effort, because it definitely does have tangible gains across the board.

Welcome dmens! Haven't seen you in awhile...

Could you comment on this quote I was sent?

"The "4-issue" of C2D is different from the "3 complex decodes" of K8.
The former reads 20 bytes per cycle and generate up to 4 micro ops.
The latter fetches 32 bytes per cycle and decodes up to 3 x86 instructions.
In sheer number, K8's 3-way x86 decoding is even better than C2D's 4-way micro op generation, but in average they should perform about the same"

Cheers

Kuzi · Sep 17, 2007

Originally posted by: dmens
3 to 4 issue will yield big returns if the entire machine is adjusted to utilize the additional throughput (i.e. C2D). My guess is that AMD decided it was not worth the effort on K10, as opposed to the change not giving enough back warrant the effort, because it definitely does have tangible gains across the board.

That's what I was thinking too when I made my first post here, as it seems C2D may have a big advantage in integer tests compared to K10, and I thought the 4 issue is what's giving the edge to C2D.

So now as I understand from all the posts here, in order to utilize 4 issue you'll need:
a) Hardware that supports/ready to utilize it.
b) An OS that has support for it.
c) Code that has support for it.

If going from 3 to 4 issue does give say an increase of 10+%, then I'd say AMD should definitely implement it in Shanghai.

Viditor · Sep 17, 2007

Originally posted by: Kuzi

Originally posted by: dmens
3 to 4 issue will yield big returns if the entire machine is adjusted to utilize the additional throughput (i.e. C2D). My guess is that AMD decided it was not worth the effort on K10, as opposed to the change not giving enough back warrant the effort, because it definitely does have tangible gains across the board.

Click to expand...

That's what I was thinking too when I made my first post here, as it seems C2D may have a big advantage in integer tests compared to K10, and I thought the 4 issue is what's giving the edge to C2D.

So now as I understand from all the posts here, in order to utilize 4 issue you'll need:
a) Hardware that supports/ready to utilize it.
b) An OS that has support for it.
c) Code that has support for it.

If going from 3 to 4 issue does give say an increase of 10+%, then I'd say AMD should definitely implement it in Shanghai.

One other point to check on Kuzi is what I'm asking dmens about (he definately knows his stuff...).
Basically it's comparing a 4 issue of micro-ops vs a 3 issue of full x86 instructions...
This is over my head, but hopefully dmens can shed some more light.
I do have a link with hints in it...
Link

Kuzi · Sep 17, 2007

Originally posted by: Viditor
One other point to check on Kuzi is what I'm asking dmens about (he definately knows his stuff...).
Basically it's comparing a 4 issue of micro-ops vs a 3 issue of full x86 instructions...
This is over my head, but hopefully dmens can shed some more light.
I do have a link with hints in it...
Link

Thanks Viditor, I didn't comment about micro-ops or full x86 instructions because I really have no idea what is the difference between the two. I'll go check your link right now

dmens · Sep 17, 2007

Originally posted by: Viditor
Welcome dmens! Haven't seen you in awhile...

Could you comment on this quote I was sent?

"The "4-issue" of C2D is different from the "3 complex decodes" of K8.
The former reads 20 bytes per cycle and generate up to 4 micro ops.
The latter fetches 32 bytes per cycle and decodes up to 3 x86 instructions.
In sheer number, K8's 3-way x86 decoding is even better than C2D's 4-way micro op generation, but in average they should perform about the same"

Cheers

yes i am referring to different metrics. c2d's "4-issue" can be interpreted as the width of the ooo and portions of the front end. the decoder arrangement refers to the throughput of the x86 decoders.

afaik k10 is still 3 micro-ops wide after the decoder, so even if the decoder can process more bytes of code than a c2d, it still has a lower micro-op throughput at that particular point in the machine. but if k10 requires less uops than c2d to do the same work, then the difference is leveled out somewhat. also depends on workload. the behavior is quite different, so i dont want to compare.

dmens · Sep 17, 2007

Originally posted by: Kuzi
That's what I was thinking too when I made my first post here, as it seems C2D may have a big advantage in integer tests compared to K10, and I thought the 4 issue is what's giving the edge to C2D.

So now as I understand from all the posts here, in order to utilize 4 issue you'll need:
a) Hardware that supports/ready to utilize it.
b) An OS that has support for it.
c) Code that has support for it.

If going from 3 to 4 issue does give say an increase of 10+%, then I'd say AMD should definitely implement it in Shanghai.

no the change is purely in hardware.

should and could are different unfortunately. the effort might be prohibitive and frequency could potentially take a heavy hit. depends on schedule really.

OneEng · Sep 17, 2007

On the question of why AMD chose to use an L3 cache:

The L3 (unlike the L2) is a semi-exclusive "victim" cache and is shared. This is nearly exclusively a multi-threaded advantage. For single threaded applications, a larger L2 would have been much better.

The L3 being shared allows the snooping traffic that exists between sockets to be greatly reduced since the 4 cores on each socket have a central shared area to snoop from that doesn't clog up the HT link.

On the issue of Core 2's L2:

Of course Core 2 gains big time when more L2 is added. ANYTHING to avoid hitting that preposterously slow FSB memory access. As pointed out, this is exactly why Intel keeps putting more and more L2 on their chips.

Intel's L2 also has the advantage of being shared..... which gives the same benefits as I stated above for AMD's quad core, but only for 2 cores at a time for Intel.

Intel also has a north bridge chip which helps eliminate the snooping between sockets.

I think that Phenom may surprise many people. It should accelerate in performance as the memory speeds are increased and the clock speeds are increased. Someone pointed out that K8 didn't ..... well boys and girls, K10 is most surely NOT K8.

K10 has double the number of cores, and EACH of the cores can consume two times the bandwidth of a K8. This translates into a four times higher "hunger" for memory bandwidth and lower latency than K8.

I also think that Phenom is going to benefit from new motherboards having HT3 enabled. I suspect that Barcelona was supposed to have launched with this feature as well and that HT1 is holding Barcelona back in some respects.

The 3D rendering scores in particular make no sense at all. K10 should eat up and spit out K8 core for core with this kind of a work load ..... but it doesn't. Something is quite obviously holding K10 back.

CTho9305 · Sep 18, 2007

Originally posted by: dmens

Originally posted by: Viditor
Welcome dmens! Haven't seen you in awhile...

Could you comment on this quote I was sent?

"The "4-issue" of C2D is different from the "3 complex decodes" of K8.
The former reads 20 bytes per cycle and generate up to 4 micro ops.
The latter fetches 32 bytes per cycle and decodes up to 3 x86 instructions.
In sheer number, K8's 3-way x86 decoding is even better than C2D's 4-way micro op generation, but in average they should perform about the same"

Cheers

Click to expand...

yes i am referring to different metrics. c2d's "4-issue" can be interpreted as the width of the ooo and portions of the front end. the decoder arrangement refers to the throughput of the x86 decoders.

afaik k10 is still 3 micro-ops wide after the decoder, so even if the decoder can process more bytes of code than a c2d, it still has a lower micro-op throughput at that particular point in the machine. but if k10 requires less uops than c2d to do the same work, then the difference is leveled out somewhat. also depends on workload. the behavior is quite different, so i dont want to compare.

K7/K8/Barcelona aren't 3 uops wide if you want to use Intel-speak. In Intel-speak, they're up to 9 wide (maybe even 12? Did I really count that properly???) depending on whether you're talking pre-Banias-Intel or post-Banias-Intel and exactly how you count. On PPro through P3, "add [eax], ebx" ("add the value in ebx to the value at the memory location pointed to by eax") is 4 uops (load, add, store-address, store-data). With Pentium M, uop fusion fixed the (very strange) design in which stores take 2 uops, and the store-address and store-data operations are tracked as a single uop through the "fused" parts of the logic (in particular, decode/retire), so that instruction turns into 3 uops. Core2 improves this further, and that instruction decodes to 2 uops in the fused domain.

On K7/K8/Barcelona, that instruction gets tracked as one unit, and the machine can decode/retire 3 of those units per cycle. K7 could retire 3 x86 instructions in a cycle even if those 3 x86 instructions performed a memory read, a calculation, and a memory write (e.g. the "add" instruction above), which puts it at 9 wide in pre-Banias Intel-speak. K8 works the same way, and as far as I know Barcelona does too. I'm not sure if the data cache can actually sustain more than 2 operations per cycle or the integer execution units can all execute simultaneously... that's probably documented somewhere. Regardless of execution, you could probably get the 3 instructions to retire simultaneously by preceding them with a long-latency operation like a multiplication. If I'm counting right, that'd give you 12 PPro/P2/P3 uops or 6 Core2 uops retiring in a single cycle.

Now, regarding decoding, the K7/K8/Barcelona decoders can all decode somewhat complex instructions. (There are some subtleties that I'll ignore here partly because I'm not 100% sure about them and partly because I can't find published documentation on them so I don't know what the right words are .) The Intel processors have only one decoder that handles complex instructions; the rest can only handle instructions that decode to a single uop. If you have a sequence of instructions that require 2 uops on the Intel chips, you'll be stuck with a throughput of a single instruction per cycle. Of course, with Pentium M / Core / Core2, Intel has expanded the set of instructions that fall into this group, so the relative penalty has shrunk a lot, but personally I find it impressive that P3 could compete with Athlon given some of the limitations it had.

I highly recommend reading The microarchitecture of Intel and AMD CPU?s: An optimization guide for assembly programmers and compiler makers, and also checking out his instruction tables for uop count details (the 4th document there).

IntelUser2000 · Sep 18, 2007

"The "4-issue" of C2D is different from the "3 complex decodes" of K8.
The former reads 20 bytes per cycle and generate up to 4 micro ops.
The latter fetches 32 bytes per cycle and decodes up to 3 x86 instructions.
In sheer number, K8's 3-way x86 decoding is even better than C2D's 4-way micro op generation, but in average they should perform about the same"

The advantages of Intel's 4-issue and K8's 3-issue is greater than you guys realize. Back in P6 architecture days, like the Pentium Pro/II/III, where the architecture didn't change, K8 had a theoretical advantage of issue bandwidth.

Yonah had decoder enhancements from Dothan:
http://www.digit-life.com/arti...2/cpu/intel-yonah.html

These changes are quite fragmentary and practically don't concern functional units. They include:

* Micro-ops fusion for SSE instructions of all types (SSE/SSE2/SSE3)
* SSE instructions are now handled by all three decoders
* SSE3 instruction set
* Faster execution of some SSE2 instructions as well as integer divide
* Enhanced data prefetch

Now since technical articles about Core 2 Duo states even more x86 instructions can use the simple decoders, the difference between Intel's and AMD's decoders is probably irrevalent.

Whatever it is, Core 2 Duo is a faster CPU than Barcelona per clock in personal computers. Because of the fact that Anand used dual socket Opteron to compare Barcelona, C2D probably really has 10-15% advantage in single thread apps.

dmens · Sep 18, 2007

so the amd procs track the macro-op, cool that is an interesting difference to note. but it isn't really accurate to say those machines are 6/12 uops wide because the frontend to rename portion of the pipeline is still 3 uops, so even if retirement is capable of taking care of more, the bottleneck is still up front.

another example is that both c2d and k8 are capable of executing more uops than fed to the schedulers per cycle, but in c2d's case, it is not accurate to say the machine is "6-issue".

CTho9305 · Sep 18, 2007

Originally posted by: dmens
so the amd procs track the macro-op, cool that is an interesting difference to note. but it isn't really accurate to say those machines are 6/12 uops wide because the frontend to rename portion of the pipeline is still 3 uops, so even if retirement is capable of taking care of more, the bottleneck is still up front.

No it's not. As far as I can tell (I'm only looking at published / 3rd party documentation) you could really decode 3 instructions like "MOV [EAX+EBX],ECX" each cycle. It wouldn't be hard to write a loop to test that experimentally... we'll see whether I choose to write it and test it or go to sleep on time

dmens · Sep 18, 2007

i was just looking at realworldtech

http://www.realworldtech.com/p...ID=RWT051607033728&p=5

ROB writes 3 per cycle from the front

IntelUser2000 · Sep 18, 2007

I think that Phenom may surprise many people. It should accelerate in performance as the memory speeds are increased and the clock speeds are increased. Someone pointed out that K8 didn't ..... well boys and girls, K10 is most surely NOT K8

Take a look at Anandtech's QX6850 article and tell me whether Barcelona indeed does scale better.

The problem with Anand's tests that screws with everybody's minds is that Barcelona tests used 1024x768 to show performance that's "CPU bound", while they did not do that with Core 2 Duo tests. Core 2 Duo's tests used 1600x1200. Tomshardware showed the pure CPU power Core 2 Duo had over Athlon's by using 1024x768 resolutions and Geforce 8800GTX. It doesn't happen at 1600x1200.

You can see in: http://www.anandtech.com/cpuch...owdoc.aspx?i=3038&p=15

that Core 2 Duo's over 2.66GHz are GPU bound to really show performance differences. You can see excellent scaling by comparing the 2.33GHz to 2.66GHz.

Oblivion: 83.2%
HL2: 67.3%

By looking at those two points alone its better than Barcelona's scaling from 2.0 to 2.5GHz
Oblivion: 66.4%
HL2: 65.6%

In fact Core 2 Duo scales better with the limited test range. And it even uses higher resolution, unlike Barcelona's "CPU bound" resolution. In WME, the scaling is even superlinear by the data at 105%. iTunes is at 95.2%

By comparison Barcelona is at 93.6% and 58% respectively.

So can we conclude scaling with Anand's Barcelona scaling benchmarks?? Absolutely not. Until the final hardware comes Barcelona only scales better in multi-threaded benchmarks.

Viditor · Sep 18, 2007

CTho9305 and dmens...I really appreciate the dialogue, but could you "dimb this down" for the rest of us?
BTW, thanks very much for the input!

Viditor · Sep 18, 2007

Originally posted by: IntelUser2000

I think that Phenom may surprise many people. It should accelerate in performance as the memory speeds are increased and the clock speeds are increased. Someone pointed out that K8 didn't ..... well boys and girls, K10 is most surely NOT K8

Click to expand...

Take a look at Anandtech's QX6850 article and tell me whether Barcelona indeed does scale better.

The problem with Anand's tests that screws with everybody's minds is that Barcelona tests used 1024x768 to show performance that's "CPU bound", while they did not do that with Core 2 Duo tests. Core 2 Duo's tests used 1600x1200. Tomshardware showed the pure CPU power Core 2 Duo had over Athlon's by using 1024x768 resolutions and Geforce 8800GTX. It doesn't happen at 1600x1200.

You can see in: http://www.anandtech.com/cpuch...owdoc.aspx?i=3038&p=15

that Core 2 Duo's over 2.66GHz are GPU bound to really show performance differences. You can see excellent scaling by comparing the 2.33GHz to 2.66GHz.

Oblivion: 83.2%
HL2: 67.3%

By looking at those two points alone its better than Barcelona's scaling from 2.0 to 2.5GHz
Oblivion: 66.4%
HL2: 65.6%

In fact Core 2 Duo scales better with the limited test range. And it even uses higher resolution, unlike Barcelona's "CPU bound" resolution. In WME, the scaling is even superlinear by the data at 105%. iTunes is at 95.2%

By comparison Barcelona is at 93.6% and 58% respectively.

So can we conclude scaling with Anand's Barcelona scaling benchmarks?? Absolutely not. Until the final hardware comes Barcelona only scales better in multi-threaded benchmarks.

Just wanted to add a few things here...

Firstly, it depends on what you mean by scaling...my impression is that you are speaking of performance vs clockspeed. There's also the question of how well it scales with performance vs #cores.
I agree with you that we can't know the answer to the former at all yet, if for no other reason than the 2.0 GHz and the 2.5 GHz were 2 different steppings (not to mention, as you say, that they really aren't the shipping systems he tested). As to performance vs #cores (which wil especially effect Tigerton), according to previous benches from AT even the K8 scales better (though this was on older Clovertowns).

"Here is a first indication that quad core Xeon does not scale as well as the other systems. Two 2.4GHz Opteron 880 processors are as fast as one Xeon 5345, but four Opterons outperform the dual quad core Xeon by 16%. In other words, the quad Opteron system scales 31% better than the Xeon system"

AT Clovertown Review

dmens · Sep 18, 2007

Originally posted by: Viditor
As to performance vs #cores (which wil especially effect Tigerton), according to previous benches from AT even the K8 scales better (though this was on older Clovertowns).

perhaps tigerton will be less susceptible to diminishing gains because the caneland platform has better equipped on memory bandwidth than bensley or whatever it's called.

brute force that mofo.

Viditor · Sep 18, 2007

Originally posted by: dmens

Originally posted by: Viditor
As to performance vs #cores (which wil especially effect Tigerton), according to previous benches from AT even the K8 scales better (though this was on older Clovertowns).

Click to expand...

perhaps tigerton will be less susceptible to diminishing gains because the caneland platform has better equipped on memory bandwidth than bensley or whatever it's called.

brute force that mofo.

he-he...sounds like a great marketing campaign, "Brute Force your Mofo Mobo".

I agree that Caneland should do better than Bensley...but I don't think it will do much better because they still have the bottleneck of the MCH to deal with.
BTW, there is a good article over at Arstechnica that does a compare and contrast on the platforms...
Arstechnica article

zsdersw · Sep 18, 2007

Originally posted by: OneEng
Of course Core 2 gains big time when more L2 is added.

Then why do most applications not show "big time" gains from additional cache?

jones377 · Sep 18, 2007

Originally posted by: zsdersw

Originally posted by: OneEng
Of course Core 2 gains big time when more L2 is added.

Click to expand...

Then why do most applications not show "big time" gains from additional cache?

Because you are trying to define the term "big time" ?

zsdersw · Sep 18, 2007

Originally posted by: jones377

Originally posted by: zsdersw

Originally posted by: OneEng
Of course Core 2 gains big time when more L2 is added.

Click to expand...

Then why do most applications not show "big time" gains from additional cache?

Click to expand...

Because you are trying to define the term "big time" ?

I doubt most people would claim that less than 10% is a "big time" gain.

Keysplayr · Sep 18, 2007

Originally posted by: Viditor

You should read the post before replying...

From the OP:

I've seen Intel Core2 processors with 8MB L2 cache perform about 10-20% faster than the same CPU but with only 4MB L2 cache, and that is clock-for-clock

Click to expand...

That would be a Core2Duo (4MB) compared to a Core2Quad (8MB). Not exactly a good direct comparison.

Here is a link to the Xbit E6420 review. Compare E6400 (2.13GHz 2MB) with a E6420 (2.13GHz 4MB) directly. You'll see that in some benchmarks, cache has little to do with performance. In others, it helps a great deal. I'm not going to state percentage numbers here. I'll leave that to all you numbers crunchers.

Xbit 6420 review

Kuzi · Sep 18, 2007

Originally posted by: keysplayr2003
That would be a Core2Duo (4MB) compared to a Core2Quad (8MB). Not exactly a good direct comparison.

Yes you are right, but they were testing single or dual core performance, just to show how much the cache can help. I wish I remember which site that was.

AMD Barcelona Thoughts/Questions

Lifer

Senior member

Platinum Member

Diamond Member

Senior member

Diamond Member

Senior member

Platinum Member

Platinum Member

Senior member

Elite Member

Elite Member

Platinum Member

Elite Member

Platinum Member

Elite Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Lifer

Senior member

Lifer

Elite Member

Senior member