I should have made my post more clear, the L2 cache in K8 is only 64bits wide, while in K10 it's 128bit. Also the memory controller is improved a lot compared to the one in K8, so that not only helps K10 make better use of faster memory, but I'd say also allow the CPU to make better use of larger caches.Originally posted by: Viditor
In what way do you think K10 is different with respect to L2 cache?
That's a very difficult question because there are so many other variables (different distros for example). However, I have never heard of Linux hitting 4 issues per clock either, except in some very rare (almost purpose-built) apps...that said, I'm not incredibly familiar with Linux, so maybe someone else can answer this?
Originally posted by: Nemesis 1
One must keep in mind that while Intel was working on the merom core future cores were also being worked on . Even tho 4 issue seldom gets a hit. Intel created it with the future in mind. So when we do see Nehalem with its completely new type of HT. Hperthreading. I see 4issue being hit alot .
So intel you could say added 4issue core to perfect the logic on the chip for up and coming releases . Does this make sense or not?
Originally posted by: jones377
It has more to do with the code itself than the decoders, although they do play a part. The IPC on most integer code is only around 1.0 on average. But even on such code, having a 4th decoder can improve performance a bit, by a few percent.
Originally posted by: Kuzi
From reading your post and other posts on this thread, I guess the difference between 4 issue or 3 issue would not be that huge, since the 4 issue is mostly unused. And maybe that explains why AMD didn't not implement it in K10.
Originally posted by: dmens
Originally posted by: Kuzi
From reading your post and other posts on this thread, I guess the difference between 4 issue or 3 issue would not be that huge, since the 4 issue is mostly unused. And maybe that explains why AMD didn't not implement it in K10.
3 to 4 issue will yield big returns if the entire machine is adjusted to utilize the additional throughput (i.e. C2D). My guess is that AMD decided it was not worth the effort on K10, as opposed to the change not giving enough back warrant the effort, because it definitely does have tangible gains across the board.
Originally posted by: dmens
3 to 4 issue will yield big returns if the entire machine is adjusted to utilize the additional throughput (i.e. C2D). My guess is that AMD decided it was not worth the effort on K10, as opposed to the change not giving enough back warrant the effort, because it definitely does have tangible gains across the board.
Originally posted by: Kuzi
Originally posted by: dmens
3 to 4 issue will yield big returns if the entire machine is adjusted to utilize the additional throughput (i.e. C2D). My guess is that AMD decided it was not worth the effort on K10, as opposed to the change not giving enough back warrant the effort, because it definitely does have tangible gains across the board.
That's what I was thinking too when I made my first post here, as it seems C2D may have a big advantage in integer tests compared to K10, and I thought the 4 issue is what's giving the edge to C2D.
So now as I understand from all the posts here, in order to utilize 4 issue you'll need:
a) Hardware that supports/ready to utilize it.
b) An OS that has support for it.
c) Code that has support for it.
If going from 3 to 4 issue does give say an increase of 10+%, then I'd say AMD should definitely implement it in Shanghai.
Originally posted by: Viditor
One other point to check on Kuzi is what I'm asking dmens about (he definately knows his stuff...).
Basically it's comparing a 4 issue of micro-ops vs a 3 issue of full x86 instructions...
This is over my head, but hopefully dmens can shed some more light.
I do have a link with hints in it...
Link
Originally posted by: Viditor
Welcome dmens! Haven't seen you in awhile...
Could you comment on this quote I was sent?
"The "4-issue" of C2D is different from the "3 complex decodes" of K8.
The former reads 20 bytes per cycle and generate up to 4 micro ops.
The latter fetches 32 bytes per cycle and decodes up to 3 x86 instructions.
In sheer number, K8's 3-way x86 decoding is even better than C2D's 4-way micro op generation, but in average they should perform about the same"
Cheers
Originally posted by: Kuzi
That's what I was thinking too when I made my first post here, as it seems C2D may have a big advantage in integer tests compared to K10, and I thought the 4 issue is what's giving the edge to C2D.
So now as I understand from all the posts here, in order to utilize 4 issue you'll need:
a) Hardware that supports/ready to utilize it.
b) An OS that has support for it.
c) Code that has support for it.
If going from 3 to 4 issue does give say an increase of 10+%, then I'd say AMD should definitely implement it in Shanghai.
Originally posted by: dmens
Originally posted by: Viditor
Welcome dmens! Haven't seen you in awhile...
Could you comment on this quote I was sent?
"The "4-issue" of C2D is different from the "3 complex decodes" of K8.
The former reads 20 bytes per cycle and generate up to 4 micro ops.
The latter fetches 32 bytes per cycle and decodes up to 3 x86 instructions.
In sheer number, K8's 3-way x86 decoding is even better than C2D's 4-way micro op generation, but in average they should perform about the same"
Cheers
yes i am referring to different metrics. c2d's "4-issue" can be interpreted as the width of the ooo and portions of the front end. the decoder arrangement refers to the throughput of the x86 decoders.
afaik k10 is still 3 micro-ops wide after the decoder, so even if the decoder can process more bytes of code than a c2d, it still has a lower micro-op throughput at that particular point in the machine. but if k10 requires less uops than c2d to do the same work, then the difference is leveled out somewhat. also depends on workload. the behavior is quite different, so i dont want to compare.
"The "4-issue" of C2D is different from the "3 complex decodes" of K8.
The former reads 20 bytes per cycle and generate up to 4 micro ops.
The latter fetches 32 bytes per cycle and decodes up to 3 x86 instructions.
In sheer number, K8's 3-way x86 decoding is even better than C2D's 4-way micro op generation, but in average they should perform about the same"
Originally posted by: dmens
so the amd procs track the macro-op, cool that is an interesting difference to note. but it isn't really accurate to say those machines are 6/12 uops wide because the frontend to rename portion of the pipeline is still 3 uops, so even if retirement is capable of taking care of more, the bottleneck is still up front.
I think that Phenom may surprise many people. It should accelerate in performance as the memory speeds are increased and the clock speeds are increased. Someone pointed out that K8 didn't ..... well boys and girls, K10 is most surely NOT K8
Originally posted by: IntelUser2000
I think that Phenom may surprise many people. It should accelerate in performance as the memory speeds are increased and the clock speeds are increased. Someone pointed out that K8 didn't ..... well boys and girls, K10 is most surely NOT K8
Take a look at Anandtech's QX6850 article and tell me whether Barcelona indeed does scale better.
The problem with Anand's tests that screws with everybody's minds is that Barcelona tests used 1024x768 to show performance that's "CPU bound", while they did not do that with Core 2 Duo tests. Core 2 Duo's tests used 1600x1200. Tomshardware showed the pure CPU power Core 2 Duo had over Athlon's by using 1024x768 resolutions and Geforce 8800GTX. It doesn't happen at 1600x1200.
You can see in: http://www.anandtech.com/cpuch...owdoc.aspx?i=3038&p=15
that Core 2 Duo's over 2.66GHz are GPU bound to really show performance differences. You can see excellent scaling by comparing the 2.33GHz to 2.66GHz.
Oblivion: 83.2%
HL2: 67.3%
By looking at those two points alone its better than Barcelona's scaling from 2.0 to 2.5GHz
Oblivion: 66.4%
HL2: 65.6%
In fact Core 2 Duo scales better with the limited test range. And it even uses higher resolution, unlike Barcelona's "CPU bound" resolution. In WME, the scaling is even superlinear by the data at 105%. iTunes is at 95.2%
By comparison Barcelona is at 93.6% and 58% respectively.
So can we conclude scaling with Anand's Barcelona scaling benchmarks?? Absolutely not. Until the final hardware comes Barcelona only scales better in multi-threaded benchmarks.
Originally posted by: Viditor
As to performance vs #cores (which wil especially effect Tigerton), according to previous benches from AT even the K8 scales better (though this was on older Clovertowns).
Originally posted by: dmens
Originally posted by: Viditor
As to performance vs #cores (which wil especially effect Tigerton), according to previous benches from AT even the K8 scales better (though this was on older Clovertowns).
perhaps tigerton will be less susceptible to diminishing gains because the caneland platform has better equipped on memory bandwidth than bensley or whatever it's called.
brute force that mofo.
Originally posted by: OneEng
Of course Core 2 gains big time when more L2 is added.
Originally posted by: zsdersw
Originally posted by: OneEng
Of course Core 2 gains big time when more L2 is added.
Then why do most applications not show "big time" gains from additional cache?
Originally posted by: jones377
Originally posted by: zsdersw
Originally posted by: OneEng
Of course Core 2 gains big time when more L2 is added.
Then why do most applications not show "big time" gains from additional cache?
Because you are trying to define the term "big time" ?
Originally posted by: Viditor
You should read the post before replying...
From the OP:
I've seen Intel Core2 processors with 8MB L2 cache perform about 10-20% faster than the same CPU but with only 4MB L2 cache, and that is clock-for-clock
Originally posted by: keysplayr2003
That would be a Core2Duo (4MB) compared to a Core2Quad (8MB). Not exactly a good direct comparison.