AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

TheELF · Oct 30, 2016

Abwx said:
Besides Clark stipulated that in ST the thread would benefit from all ressources, because it s obvious that in SMT the first thread lose throughput but that the sum is superior to a single thread throughput, that s not documented but Intel s HT doesnt work by providing 30% higher throughput over say 100% throughput of a single thread, reality is that the second thread eat in the first thread throughput and that with HT the repartition is rather 90 + 40 and not 100 + 30.

Lol,you can't get more then 100% usage out of one core,the only way to gain from SMT is if one single thread can't use all available instructions of the core,if your single software thread looses say 20% of IPC due to cache stalls or branch misses or whatever then the SMT thread can use (up to) this 20% to run,the whole is never above 100% of what the core can do,you can only go over what the software thread is able to use.

lolfail9001 · Oct 30, 2016

Abwx said:
If your "estimation" was right it would mean that AMD s SMT bring 85-90% gain in Blender, and generaly 19% more gain than Intel s HT, as you can see previsions based on wishfull thoughts end being fairy tales.

Someone, quickly, fire up Blender on POWER8.

Dresdenboy · Oct 30, 2016

TheELF said:
It's not 40% faster,it's 40% more IPC,that's throughput not speed,it will only be 40% faster if you actually find a software that will be able to use all 10 instructions the ZEN core has available per cycle.

Maybe I didn't understand your point, but this is a simplistic form of my view here:
For ST (@iso freq):
T'put = #inst./cycle
Speed= #inst./cycle

And who said, that in real ST code XV constantly maxes out an IPC of 7.0 (2ALU+2AGU+3FP)? Why would a SW need to reach 10 ops/clock if in reality the value is more like 1.0 to 3.0, sometimes higher

Dresdenboy · Oct 30, 2016

Abwx said:
Besides Clark stipulated that in ST the thread would benefit from all ressources, because it s obvious that in SMT the first thread lose throughput but that the sum is superior to a single thread throughput, that s not documented but Intel s HT doesnt work by providing 30% higher throughput over say 100% throughput of a single thread, reality is that the second thread eat in the first thread throughput and that with HT the repartition is rather 90 + 40 and not 100 + 30.

Especially in Blender those threads are roughly doing the same kind of work. Without any prioritization the threads would be more like 65+65 in your scenario then.

Let's talk responsiveness.

Abwx · Oct 30, 2016

TheELF said:
Lol,you can't get more then 100% usage out of one core,the only way to gain from SMT is if one single thread can't use all available instructions of the core,if your single software thread looses say 20% of IPC due to cache stalls or branch misses or whatever then the SMT thread can use (up to) this 20% to run,the whole is never above 100% of what the core can do,you can only go over what the software thread is able to use.

Who said that it s 100% usage that i was talking about..?.

If a thread is at say 100 and SMT bring another 30 then it s obvious that in ST the core wasnt at 100%, i m talking of % relative to the first thread output, if the first thread is at 100% then it will be at 90% if a second thread is processed, there s no way that the second thread wont reduce the first thread throughput.

TheELF said:
40% is the gain on a single thread if the single thread can use all 10 instructions,it's the same thing I said.
This is the throughput of one single thread running on one single core.

Lol, that s complete non sense, if that was the case there wouldnt be a single instance where they could say that it has 40% higher IPC, and even if they did as you re stating then SMT would bring 0% gain, hey, it s already fully maxed to get thoses 40% according to you, lol.

He said 40% with the first thread and SMT will come on top.

Abwx · Oct 30, 2016

Dresdenboy said:
Without any prioritization the threads would be more like 65+65 in your scenario then.

Let's talk responsiveness.

Indeed, but then prioritization cant be done on a cycle per cycle basis, so there will be instances where the second thread will cause the first thread to lose a few cycles here and there, as said if the first thread has a throughput of 100 and that SMT bring 30 then with two threads the respective throughputs will be something like 90 + 40, this can be seen on some games where Intel HT reduce perfs.

TheELF · Oct 30, 2016

Dresdenboy said:
Maybe I didn't understand your point, but this is a simplistic form of my view here:
For ST (@iso freq):
T'put = #inst./cycle
Speed= #inst./cycle

And who said, that in real ST code XV constantly maxes out an IPC of 7.0 (2ALU+2AGU+3FP)? Why would a SW need to reach 10 ops/clock if in reality the value is more like 1.0 to 3.0, sometimes higher

You are talking about how code runs,where did anybody from AMD said that (all) code will run with 40% more IPC? All they said was that the core will have 40% more IPC.
And that's all I'm saying too,if a thread can (could) use 10 IPC per core then it would gain 40% in execution speed.
Don't forget that intel raised IPC for many gens without raising execution units,that's software IPC (or CPI how many cycles it takes for an instruction to finish) how much of the available IPC software can use,AMD only talked about hardware IPC how many instructions are available on the core (literally, they said 40% more IPC per core).

TheELF · Oct 30, 2016

Abwx said:
Who said that it s 100% usage that i was talking about..?.

If a thread is at say 100 and SMT bring another 30 then it s obvious that in ST the core wasnt at 100%, i m talking of % relative to the first thread output, if the first thread is at 100% then it will be at 90% if a second thread is processed, there s no way that the second thread wont reduce the first thread throughput.

Lol, that s complete non sense, if that was the case there wouldnt be a single instance where they could say that it has 40% higher IPC, and even if they did as you re stating then SMT would bring 0% gain, hey, it s already fully maxed to get thoses 40% according to you, lol.

He said 40% with the first thread and SMT will come on top.

SMT would bring 0% gain if there where any perfect software that would always be able to use all available instructions.
Since there is no such software you often have some improvement.
If you have one thread running with real-time priority on a core then it should not loose any speed if you run a second thread through SMT with low priority.
But still it does not matter,the maximum you can get out of a core is the maximum amount of instructions it has available,it doesn't matter how you split it up.

Abwx · Oct 30, 2016

TheELF said:
You are talking about how code runs,where did anybody from AMD said that (all) code will run with 40% more IPC? All they said was that the core will have 40% more IPC.

And you think that they measured this IPC increasement how..?.

Without using applications.?.

TheELF said:
And that's all I'm saying too,if a thread can (could) use 10 IPC per core then it would gain 40% in execution speed.

You are wrong, one more time, if what you said was right then it would mean that the max theorical throughput has increased by 40%, wich would mean that it would have been well below BDW in the Blender demo they displayed,.

FTR Zen has more than twice the throughput of an XV core in this demo, surely that it s compatible with your theories extracted from about nowhere, indeed they couldnt even state 40% better IPC in the case you re stating, because what matters is the IPC improvement in real apps, and they perfectly know that this will be checked once the chip launch.

TheELF · Oct 30, 2016

Abwx said:
And you think that they measured this IPC increasement how..?.

Without using applications.?.

There is nothing to be measured with applications,construction cores have 4 integer + 2 FPU instructions per core=6 IPC, zen will have 10 IPC.

Abwx said:
You are wrong, one more time, if what you said was right then it would mean that the max theorical throughput has increased by 40%, wich would mean that it would have been well below BDW in the Blender demo they displayed,.

It is well below Broadwell since it only has 8 instructions per core while ZEN has 10.

Abwx said:
indeed they couldnt even state 40% better IPC in the case you re stating, because what matters is the IPC improvement in real apps, and they perfectly know that this will be checked once the chip launch.

Yup,just like with the FX before it,it will only be checked with distributed computing software or GPU benchmarks, both of wich are prime candidates for being able to use all available throughput making it look fast.

Dresdenboy · Oct 30, 2016

Abwx said:
Indeed, but then prioritization cant be done on a cycle per cycle basis, so there will be instances where the second thread will cause the first thread to lose a few cycles here and there, as said if the first thread has a throughput of 100 and that SMT bring 30 then with two threads the respective throughputs will be something like 90 + 40, this can be seen on some games where Intel HT reduce perfs.

It has to be seen, whether OS thread priorities will be involved here, additionally to doing live performance metric evaluation to adjust thread execution.

TheELF said:
You are talking about how code runs,where did anybody from AMD said that (all) code will run with 40% more IPC? All they said was that the core will have 40% more IPC.

Of course I am. I don't know if there is much use in such a metric and in reduntantly calling it "IPC", if there are simple metrics like "issue width", "decoding rate" etc.

TheELF said:
And that's all I'm saying too,if a thread can (could) use 10 IPC per core then it would gain 40% in execution speed.
Don't forget that intel raised IPC for many gens without raising execution units,that's software IPC (or CPI how many cycles it takes for an instruction to finish) how much of the available IPC software can use,AMD only talked about hardware IPC how many instructions are available on the core (literally, they said 40% more IPC per core).

Please check your definition of IPC. And I'm interested in where you are seeing something in Zen being 1.4x that of XV.

TheELF said:
SMT would bring 0% gain if there where any perfect software that would always be able to use all available instructions.
Since there is no such software you often have some improvement.
If you have one thread running with real-time priority on a core then it should not loose any speed if you run a second thread through SMT with low priority.
But still it does not matter,the maximum you can get out of a core is the maximum amount of instructions it has available,it doesn't matter how you split it up.

Exactly, there is no such SW.

Abwx · Oct 30, 2016

TheELF said:
Yup,just like with the FX before it,it will only be checked with distributed computing software or GPU benchmarks, both of wich are prime candidates for being able to use all available throughput making it look fast.

Before wasnt BD but Steamroller and then Excavator, both were announced with a given IPC improvement by AMD and everyone here knows that it was related to applications not theorical throughputs, indeed this show in benchmarks, so you are just badly re interpreting the notion of IPC according to your needs but certainly not to common understanding of this term, like in the quote below :

TheELF said:
It is well below Broadwell since it only has 8 instructions per core while ZEN has 10.

Lol, it was above in the bench, and that s all that matters to prove that it had better throughput at least in this demo, and surely also in apps using legacy code, anyway i see that now you are redefyning the meaning of the word "below" even if that means discarding the real numbers and using theorical figures that have no relevance in the debate...

Abwx · Oct 30, 2016

Dresdenboy said:
Please check your definition of IPC. And I'm interested in where you are seeing something in Zen being 1.4x that of XV.
.

That 1.4x is the result of random logic, you shouldnt even waste your time asking for explanations since there s none.

Notice that we are told that Cat cores are 6 IPC while Zen is 10 IPC and hence only 66% better at best, but of course that BDW is only 8 IPC doesnt mean that it s only 30% better than the Cat cores, it s all a question of perspective and double standard since we are also told that BDW s 8 IPC is above Zen s 10 IPC...

TheELF said:
There is nothing to be measured with applications,construction cores have 4 integer + 2 FPU instructions per core=6 IPC, zen will have 10 IPC.

It is well below Broadwell since it only has 8 instructions per core while ZEN has 10.

.

If that s not trolling, then what is it..

blublub · Oct 30, 2016

Lots of real technical talk here, way out of my league.

My simple notion:

Given AMD's face palm with BD I highly doubt they were showing of ZEN in blender as the best of the best scenario.

Management has done better in the recent 6 months than all the years before - so I really don't see them messing this up so stupidly

Dresdenboy · Oct 30, 2016

TheELF said:
There is nothing to be measured with applications,construction cores have 4 integer + 2 FPU instructions per core=6 IPC, zen will have 10 IPC.

It is well below Broadwell since it only has 8 instructions per core while ZEN has 10.

IPC usually means measured intructions per clock over several cycles (e.g. 1M, or all cycles of an application run), not "issue width", which is kind of a peak value, as the decoders and µOp$ simply can't provide enough instructions per cycle.

Abwx said:
That 1.4x is the result of random logic, you shouldnt even waste your time asking for explanations since there s none.

Notice that we are told that Cat cores are 6 IPC while Zen is 10 IPC and hence only 66% better at best, but of course that BDW is only 8 IPC doesnt mean that it s only 30% better than the Cat cores, it s all a question of perspective and double standard since we are also told that BDW s 8 IPC is above Zen s 10 IPC...

In a thread, which is being read by less tech savvy people, I think, it is important not to simply leave wrong statements as they are.

BTW, here is Jaguar's (and Bobcat's) IPC for applications (whatever they are):

Abwx said:
If that s not trolling, then what is it..

Sometimes even smart people might run into kind of a dead end, thought-wise. Then it might help to take them out of their seat, shake them a bit, and put them back in place. Or alternatively, one might kindly ask them to step back and check their POV.

superstition · Oct 30, 2016

Dresdenboy said:
Sometimes even smart people

No one knows everything, certainly.

bjt2 · Oct 30, 2016

Dresdenboy said:
IPC usually means measured intructions per clock over several cycles (e.g. 1M, or all cycles of an application run), not "issue width", which is kind of a peak value, as the decoders and µOp$ simply can't provide enough instructions per cycle.

In a thread, which is being read by less tech savvy people, I think, it is important not to simply leave wrong statements as they are.

BTW, here is Jaguar's (and Bobcat's) IPC for applications (whatever they are):

Sometimes even smart people might run into kind of a dead end, thought-wise. Then it might help to take them out of their seat, shake them a bit, and put them back in place. Or alternatively, one might kindly ask them to step back and check their POV.

Are the last two columns the mean percentage of gates not clocked during the time period? Are these numbers applicable on Zen (maybe even better)? If yes, these numbers are quite interesting. They say that even a power virus leaves most of the transistors off... So the clock gating has reached a very interesting efficiency. And since the leakage on 14nm FF LPP is VERY low compared to 28nm BULK, this can mean very low power consumption for Zen core, since clock gated transistors only draw leakage power...

Arachnotronic · Oct 30, 2016

I took an XV @ 4.2GHz (A12-9800), multiplied all Geekbench 4 sub-scores by 1.4, and compared the subscores to a Broadwell-E 6950X @ 4.2GHz (stock L3$). Here's what I got.

AtenRa · Oct 30, 2016

I would suggest you use the Core i3 6100 instead, A12-9800 is an APU with L2 cache only at 65W TDP when BD-E 6950K is a L3 25MB at 140W TDP SKU.

Abwx · Oct 30, 2016

Arachnotronic said:
I took an XV @ 4.2GHz (A12-9800), multiplied all Geekbench 4 sub-scores by 1.4, and compared the subscores to a Broadwell-E 6950X @ 4.2GHz (stock L3$). Here's what I got.

Dont think that memory latency will increase by 40% as well..

So what is your conclusion, should we stick with this 40% number given that the FP scores show that it s unlikely that Zen could have been up to BDW in Blender if the FP related IPC was increased by exactly 40%..?..

AtenRa said:
I would suggest you use the Core i3 6100 instead, A12-9800 is an APU with L2 cache only at 65W TDP when BD-E 6950K is a L3 25MB at 140W TDP SKU.

You have a point here but it s likely that the FP scores, contrary to the INT scores, are not much influenced by the cache, at least in ST.

dark zero · Oct 30, 2016

Arachnotronic said:
I took an XV @ 4.2GHz (A12-9800), multiplied all Geekbench 4 sub-scores by 1.4, and compared the subscores to a Broadwell-E 6950X @ 4.2GHz (stock L3$). Here's what I got.

Take in care something... some instructions doesn't have Excavator while Zen is supposed to have. That would change the results in some aspects.
In others, I was expecting Haswell levels on ST and Broadwell levels on MT... seems that both will be on Haswell levels. Is not awful, but not godly.

Well priced, they would get some fanbase.

Arachnotronic · Oct 30, 2016

AtenRa said:
I would suggest you use the Core i3 6100 instead, A12-9800 is an APU with L2 cache only at 65W TDP when BD-E 6950K is a L3 25MB at 140W TDP SKU.

I don't have a Core i3 6100 available to me at the moment, sorry. Anyway, the i3 6100 is Skylake based and the L3$ is actually a lot faster on SKL than it is on BDW-E (2.8GHz only). GB4 likes L3$ speed more than it likes L3$ size, AFAICT.

Dresdenboy · Oct 30, 2016

bjt2 said:
Are the last two columns the mean percentage of gates not clocked during the time period? Are these numbers applicable on Zen (maybe even better)? If yes, these numbers are quite interesting. They say that even a power virus leaves most of the transistors off... So the clock gating has reached a very interesting efficiency. And since the leakage on 14nm FF LPP is VERY low compared to 28nm BULK, this can mean very low power consumption for Zen core, since clock gated transistors only draw leakage power...

I'd expected some ratio: % of max possible, not of all.
Here is an explanation:
http://www.eetimes.com/document.asp?doc_id=1276117

Zen likely has roughly similar values.

Abwx · Oct 30, 2016

Since we are talking about GB4 i d like to give a few precisions, if i can say so..

First is that to extroplate this bench numbers and make the comparison with AMD s Blender demo one should discard GBs FP scores that are not related to what Blender actually does, FI the SGEMM and SFFT subtests are relevant because they are based on single precision computation like Blender, the Speech recognition is also single precision methink.

The N-Body subtest on the other hand is surely using double precision since this kind of problem use equations that diverge vastly with initial conditions to the point that even double precision is not enough if the time windows are too long, wich would imply using X87 at some point...

While we are on renderers i will also point that Cinebench 11.5 and R15 are two completely different benches that measures completely different things, the former use single precision while the latter use double precision, as such it was a mistake from all sites to discard completely 11.5 to the benefit of R15.

Last, but not least, i think that Blublub is right when he states that AMD didnt show Zen in best possible light, and that s surely the reason why they used Blender as it disclose only the single precision total throughput and with a SMT gain such that it s impossible to guess what is the contribution of the first and second thread.

Should they have used CB 11.5 that the SMT gain would be much easier to estimate while CB R15 would had disclosed both double precision throughput and SMT gain, same with PovRay (wich has an opposite behaviour to Blender in that it manage to max out a core with a single thread) wich use double precision.

HiroThreading · Oct 31, 2016

EDIT: Deleted.

Apologies, I was being off topic.

AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Diamond Member

Golden Member

Golden Member

Golden Member

Lifer

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Golden Member

Lifer

Lifer

Member

Golden Member

Platinum Member

Senior member

Lifer

Lifer

Lifer

Platinum Member

Lifer

Golden Member

Lifer

Member