AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

cdimauro · Nov 19, 2016

bjt2 said:
Mean IPC in consumer code is around 1/cycle or lower. HPC FPU code can reach 2.5 (e.g. Spec FP). Almost only power viruses can go over 3.
AMD can do 4 INT PLUS 4 FP PLUS 2 MEM.

You have to see how many uops can be dispatched per clock cycle by the micro-op cache: this is the bottleneck.

Intel processors can dispatch 8 of them, and the scheduler works in the best condition (it has the complete "vision" and can make the best decision).

Last but not really least, the decoder can decoder up to 4 instructions, and this is one of the most important bottlenecks for ST-code.

INTEL can do 4 among FP and INT (and FP are limited to 2 true FP + 2 vecint) and 3-4 MEM.

Not counting the 256-bit AVX instructions.

So in some, if not most, cases AMD's SMT can even beat INTEL's...

Well, if Blender was the best case, as I think, there'll be only a few cases where it can happen, and by a very small margin.

Which isn't a good result.

Dresdenboy · Nov 19, 2016

TheELF said:
It's not 40% faster,it's 40% more IPC,that's throughput not speed,it will only be 40% faster if you actually find a software that will be able to use all 10 instructions the ZEN core has available per cycle.
Which will be pretty difficult since there aren't many CPUs out there (if there are any) with 10 instructions per core,I guess that's why they went with blender instead of some "traditional" benchmark.

I think we were past this point already. But 40% could also be reached by comparing Zen with a CPU, which only achieves a 29% lower t'put, as you say.

This assumption of the need to have a constant 100% utilization also leads to XV already achieving ~100% on 7 FUs for 1T. This also would mean that 40% can't be average, as it is already the max. value.

Thus you are implicitly stating that XV should have 2-7x IPC of SKL based on SKL real world measurements.

We had this before. Please verify your assumptions.

Arachnotronic · Nov 19, 2016

majord said:
did you look at the outliers which made up the score?

the most significant of which (from the first A12 set):

HTML5 DOM:

Bristol ridge :932
Haswell-E : 3908

= 420% IPC advantage

Then there's the inclusion of Memory performance in the test

I'm not saying all workloads should show a consistent performance delta between architectures, they never do, but you do have to be able to recognise an outlier that's so extreme it's capable of skewing an IPC comparison by 20-30%

This, and the inclusion of things like AES (Which work back in Excavators favor) are reasons why Geekbench is not very useful for architecture comparisons

GB4 score is calculated using geometric mean.

https://en.m.wikipedia.org/wiki/Geometric_mean

Arachnotronic · Nov 19, 2016

jpiniero said:
I thought we had this discussion already... it's likely due to the 9800's lack of L3. The 8350 for instance gets around 3400 on the HTML 5 DOM test. If anything, it's great because it highlights weaknesses of a chip that you might not obviously see with just one test.

We did have this discussion before and I shot down your thesis by pointing out that th A9X performs as well per clock as the A9 in this test. A9X has no l3$, while A9 does.

jpiniero · Nov 19, 2016

Arachnotronic said:
We did have this discussion before and I shot down your thesis by pointing out that th A9X performs as well per clock as the A9 in this test. A9X has no l3$, while A9 does.

It's not the L3, it's the lack of L3 on Bristol Ridge. Those are two separate things. You'd have to come up with a good reason that Vishera is so much faster to think otherwise.

TheELF · Nov 19, 2016

Dresdenboy said:
I think we were past this point already. But 40% could also be reached by comparing Zen with a CPU, which only achieves a 29% lower t'put, as you say.

This assumption of the need to have a constant 100% utilization also leads to XV already achieving ~100% on 7 FUs for 1T. This also would mean that 40% can't be average, as it is already the max. value.

Thus you are implicitly stating that XV should have 2-7x IPC of SKL based on SKL real world measurements.

We had this before. Please verify your assumptions.

Did they state that it's 40% avg?
Yes 40 % could be throughput only it could be speed only it could be a mix of both or it could be whatever.
When people here state same IPC as... what are they refering to?Speed or throughput or avg in a number of cases?
What do you call the speed if you have one thread that only needs one command per cycle?That's where intel is so much better,it has a lot of things that improve cycles per instruction.

Too many unknowns...

Arachnotronic · Nov 19, 2016

TheELF said:
Did they state that it's 40% avg?
Yes 40 % could be throughput only it could be speed only it could be a mix of both or it could be whatever.
When people here state same IPC as... what are they refering to?Speed or throughput or avg in a number of cases?
What do you call the speed if you have one thread that only needs one command per cycle?That's where intel is so much better,it has a lot of things that improve cycles per instruction.

Too many unknowns...

IPC is commonly agreed on to mean "single threaded performance per clock."

TheELF · Nov 19, 2016

Arachnotronic said:
IPC is commonly agreed on to mean "single threaded performance per clock."

It was also commonly agreed upon what core meant...

bjt2 · Nov 19, 2016

cdimauro said:
You have to see how many uops can be dispatched per clock cycle by the micro-op cache: this is the bottleneck.

6 uop cycle from the ucache is enough for two heavy duty 3 IPC threads... And 10 units are eough for the peak and keep 6 uops/cycle steady execution flux (on optimized code and without cache misses, obviously)

cdimauro said:
Intel processors can dispatch 8 of them, and the scheduler works in the best condition (it has the complete "vision" and can make the best decision).

http://www.agner.org/optimize/blog/read.php?i=415 According to agner, 4 fused uops, without fusion limits, so it is correct that the limit is 8 uops, but rarely there are all this fused ops...

cdimauro said:
Last but not really least, the decoder can decoder up to 4 instructions, and this is one of the most important bottlenecks for ST-code.

Yes but we know how dense are the AMD MOPs, up to 3 uops (or an RMW intruction), and from the uop cache, if the hit rate is enough, we optain 6 uops cycle

cdimauro said:
Not counting the 256-bit AVX instructions.

In 256 bit is correct that you have the same (fmul or fadd) or double (fmac) throughput of Zen, because it can do 2x256 bit FMAC and 2x256 VECINT, but in that case no int operations because all 4 ports are occupied, while zen can do 4 int ops in parallel. This is what i said. Without specifying bits. But it can do 1 FMUL+1 FADD or 1 FMAC+ 1 FADD or 1 FMAC + 1 FMUL or FMAC and for legacy x87 or 128 bit code Zen throughput is equal or superior: 2 FMUL + 2 FADD or 1 FMAC + 1 FMUL + 1 FADD or 2 FMAC

cdimauro said:
Well, if Blender was the best case, as I think, there'll be only a few cases where it can happen, and by a very small margin.

Which isn't a good result.

I don't think that it's an outlier... Surely it's the best case, but looking at the decoders, uop cache, cache bandwidth, number of pipeline, i don't think its IPC will be so low...

Regarding SMT... At 128 bit certainly the higher ports helps to have higher MT IPC...

cdimauro · Nov 19, 2016

Arachnotronic said:
IPC is commonly agreed on to mean "single threaded performance per clock."

I don't subscribe it, as we already discussed here around 1 month ago.

bjt2 said:
6 uop cycle from the ucache is enough for two heavy duty 3 IPC threads... And 10 units are eough for the peak and keep 6 uops/cycle steady execution flux (on optimized code and without cache misses, obviously)

And according to the Blender results, Broadwell(-E) resources guaranty almost exactly the same result.

http://www.agner.org/optimize/blog/read.php?i=415

Click to expand...

According to agner, 4 fused uops, without fusion limits, so it is correct that the limit is 8 uops, but rarely there are all this fused ops...

Sure, but Intel's scheduler has a better chance to feed its ports, since it has an exact "vision" of the platform in a precise moment.

Yes but we know how dense are the AMD MOPs, up to 3 uops (or an RMW intruction), and from the uop cache, if the hit rate is enough, we optain 6 uops cycle

Intel's uop can also carry a lot of "actions/operations". That's why you see similar results in Blender. And that's why in has shown much better ST performance.

I pretty curious to see how well Zen can perform with heavy ST code, like emulators, compilers, databases, etc..

In 256 bit is correct that you have the same (fmul or fadd) or double (fmac) throughput of Zen, because it can do 2x256 bit FMAC and 2x256 VECINT, but in that case no int operations because all 4 ports are occupied, while zen can do 4 int ops in parallel. This is what i said. Without specifying bits. But it can do 1 FMUL+1 FADD or 1 FMAC+ 1 FADD or 1 FMAC + 1 FMUL or FMAC and for legacy x87 or 128 bit code Zen throughput is equal or superior: 2 FMUL + 2 FADD or 1 FMAC + 1 FMUL + 1 FADD or 2 FMAC

FMAC is emulated by FMAC + FMADD in Zen, as you know. That's why it can't reach Intel performance in this case.
Intel has also two symmetrical FPU units, that can do any FMAC/FMUL/FADD. So it can do 2 256-bit FMACs or FMULs or FADDs, whereas Zen is limited to 1 of those (per type).

It can also do 2 VALU or VShift (on the same port), and there's another one for VALU or VShuffle.

On Zen we don't know how is the situation, but if it follows a design similar to the Jaguar one, you only have 2x128-bit = 1x256 bit VShift or "conversion", and 4x128-bit = 2x256-bit VALU; I don't know about the shuffle operations.

And you are wrong about the integer units: even with all 3 Int/FPU ports busy, there's always one which is free for "integer" (ALU, Shift, Branch) operations.

Plus 2x256-bit load, 1x256 bit store, and one AGU units. Whereas here Zen has only 2 ports (the 2 AGUs) where you can submit memory operations.

I don't think that it's an outlier... Surely it's the best case, but looking at the decoders, uop cache, cache bandwidth, number of pipeline, i don't think its IPC will be so low...

I haven't said that it's low, but I think that it's likely that it'll be lower than Intel counterparts.

IF (and only if) AMD's statement is correct, 40% more IPC compared to XV can give an estimate.

Regarding SMT... At 128 bit certainly the higher ports helps to have higher MT IPC...

We have already seen Blender results...

cytg111 · Nov 19, 2016

TheELF said:
It was also commonly agreed upon what core meant...

you need help climbing down from that tree . Lets agree to disagree.

TheELF · Nov 19, 2016

Well it buggs me what can I say,it gives any company card blanche to use technical terms to misslead it's customers without lying.

Dresdenboy · Nov 19, 2016

TheELF said:
Well it buggs me what can I say,it gives any company card blanche to use technical terms to misslead it's customers without lying.

I have no evidence that they did, in both cases. Do you? Is there a ruling in the case about PD cores, yet?

Problems could arise if there is room for interpretation. Whoever touched areas like science or requirement engineering will know what I mean.

bjt2 · Nov 19, 2016

cdimauro said:
And you are wrong about the integer units: even with all 3 Int/FPU ports busy, there's always one which is free for "integer" (ALU, Shift, Branch) operations.

This because the FP ports are three, of which two can do FP and all three VECINT, if I remember well... I thought that also the fourth port could do vecint ops, my bad... That would not be so bad, because the scheduler could have given high priority to the int ops... But if the fourth is free of FP and VECINT, it's better for the scheduler (simpler) and bad only for VECint performance...

cdimauro said:
IF (and only if) AMD's statement is correct, 40% more IPC compared to XV can give an estimate.

I think that 40% is a minimum value... In this thread was shown that the lack of L3 in XV hurts performance a lot.
The differences in blender are more 70-80% than 40% and now Zen is on par... So the maximum gain is more in the 70-80% ballpark than 40%... Moreover frontend (uop cache and decoders), backend (queues, stack engine, 2x retire rate) and cache (L1 L2 and L3) are improved a lot from XV, so i am not surprised of this huge gain...

coercitiv · Nov 19, 2016

cdimauro said:
I don't subscribe it, as we already discussed here around 1 month ago.

Yes, and when pressed to discuss how that would apply to CPUs with the same micro architecture but different core count, you began repeating the definition of IPC, which states it is application specific - hence dependent on the amount of threads said app can use. Once we go on that strict road we can only mention IPC in relation to a specific app, which for discussions on this forum is a hindrance.

That is why, whether you like or not, people have commonly agreed to equate IPC to "single threaded performance per clock", while using the term "throughput" to indicate multi-threaded scenarios, and only "performance" when clocks are also taken into consideration. It's turned out to be quite a productive arrangement, helping people get their point across, but by all means, go ahead try to change that for your own comfort!

imported_ats · Nov 20, 2016

cdimauro said:
You have to see how many uops can be dispatched per clock cycle by the micro-op cache: this is the bottleneck.

Intel processors can dispatch 8 of them, and the scheduler works in the best condition (it has the complete "vision" and can make the best decision).

Last but not really least, the decoder can decoder up to 4 instructions, and this is one of the most important bottlenecks for ST-code.

No, no it is not. For ST code, branches and loads are the most important bottlenecks. Neither of those really depends on instruction decoding or scheduling. Most code is extremely fortunate if it gets even close to 2 IPC.

cdimauro · Nov 20, 2016

bjt2 said:
This because the FP ports are three, of which two can do FP and all three VECINT, if I remember well... I thought that also the fourth port could do vecint ops, my bad... That would not be so bad, because the scheduler could have given high priority to the int ops... But if the fourth is free of FP and VECINT, it's better for the scheduler (simpler) and bad only for VECint performance...

3 256-bit VecInt ports should enough.

I think that 40% is a minimum value... In this thread was shown that the lack of L3 in XV hurts performance a lot.
The differences in blender are more 70-80% than 40% and now Zen is on par... So the maximum gain is more in the 70-80% ballpark than 40%... Moreover frontend (uop cache and decoders), backend (queues, stack engine, 2x retire rate) and cache (L1 L2 and L3) are improved a lot from XV, so i am not surprised of this huge gain...

It's AMD that stated the +40% IPC over XV, not me. Don't you trust what AMD reported?

coercitiv said:
Yes, and when pressed to discuss how that would apply to CPUs with the same micro architecture but different core count, you began repeating the definition of IPC, which states it is application specific - hence dependent on the amount of threads said app can use. Once we go on that strict road we can only mention IPC in relation to a specific app, which for discussions on this forum is a hindrance.

Guess what: you run/test specific applications to get some IPC number.

That is why, whether you like or not, people have commonly agreed to equate IPC to "single threaded performance per clock", while using the term "throughput" to indicate multi-threaded scenarios, and only "performance" when clocks are also taken into consideration. It's turned out to be quite a productive arrangement, helping people get their point across, but by all means, go ahead try to change that for your own comfort!

Well, nobody stops to take other metrics to measure something else.

If the IPC definition isn't good, you can define a new metric. Everything depends on what you want to measure.

imported_ats said:
No, no it is not. For ST code, branches and loads are the most important bottlenecks. Neither of those really depends on instruction decoding or scheduling. Most code is extremely fortunate if it gets even close to 2 IPC.

I only wanted to stress the fact that the decoder itself puts a limit on the number of instructions that can be executed.

To be more clear: you can have tens of ports/executing units, but in ST code you have this upper limit for the best cases, and so the number of ports that can happily be used.

Which is not the common/normal case. As you reported, having many branches (and chains of dependent instructions; especially bases on loads) hurts the IPC... and can stress a lot the decoders. If you have to frequently change the executed code, having queues for micro-ops can be of little or no help here, and the decoder only guarantees a maximum of 4 of instructions to feed the backend.

That's why I want to see some benchmarks with emulators, etc..

bjt2 · Nov 20, 2016

cdimauro said:
It's AMD that stated the +40% IPC over XV, not me. Don't you trust what AMD reported?

AMD didn't say if the 40% was minimum, mean, maximum, but looking at blender, the maximum is at least +80%. This imply that 40% at least is the mean value, that is reasonable. Stating that 40% is the minimum in an official statement is very dangerous, because a single example of less than 40% gain would suffice to be wrong.
Thinking better on this subject, i think (and probabily can be verified lokking at the small notes) that AMD intended mean IPC, as it is reasonable.

cdimauro said:
To be more clear: you can have tens of ports/executing units, but in ST code you have this upper limit for the best cases, and so the number of ports that can happily be used.

Which is not the common/normal case. As you reported, having many branches (and chains of dependent instructions; especially bases on loads) hurts the IPC... and can stress a lot the decoders. If you have to frequently change the executed code, having queues for micro-ops can be of little or no help here, and the decoder only guarantees a maximum of 4 of instructions to feed the backend.

That's why I want to see some benchmarks with emulators, etc..

In the case of Zen you can safely assume a weighted mean between 4 and 6 upos/cycle, weighted by the uop cache hit/miss ratio. Because if there is a branch misprediction, the uops can still be taken on the uop cache, so if there is an hit, the throughput is still 6 uops/cycle.
You don't need the decoder again, except if the branch is in a never taken piece of code... But we are talking of loops here, right? Because a single path without loops has a negligible execution time, compared to loops of code and so in the total performance evaluation can be safely ignored. We are talking of program that take at least seconds to be executed, so piece of code executed only one time and first execution of a loop, were the uop cache probabily would miss, are a negligile fraction of the total execution time...

FIVR · Nov 20, 2016

Geez, 74 pages and 73 of them are special friends ELF and Cidimauro arguing about what "40%" means, as if that has any bearing on current zen performance whatsoever. Can we give these trolls their own thread to fill with this drivel? It's pedantic and totally unrelated to technology, it's also threadcrapping at this point.

KTE · Nov 20, 2016

I expected 40% to be the minimum actually, and I have seen it quoted somewhere here, but when searching I could not find a direct reference, except:

First point suggests pretty clearly that 40% is with the SMT. Otherwise why list it under how the 40% IPC is gained?

Going off HotChips, I'd say it's IPC pretty clearly, and that means per core. Not 40% performance.

Sent from HTC 10
(Opinions are own)

inf64 · Nov 20, 2016

KTE said:
I expected 40% to be the minimum actually, and I have seen it quoted somewhere here, but when searching I could not find a direct reference, except:

First point suggests pretty clearly that 40% is with the SMT. Otherwise why list it under how the 40% IPC is gained?

Going off HotChips, I'd say it's IPC pretty clearly, and that means per core. Not 40% performance.

Sent from HTC 10
(Opinions are own)

40% is ST IPC and was confirmed by AMD fellow at the HC QnA session:

Question: Did 40% uplift on Zen IPC include boost from SMT? AMD: No, that was just 1-thread improvement. #HotChips

https://mobile.twitter.com/Daniel_Bowers/status/768270633125806081

I think it is clear that 40% is just an average number they got from a mix of ST workloads (mostly int). SMT could bring 20-30% more performance when 2 threads run on one core meaning thread for thread Zen could be even 80%+ better in MT workloads.

KTE · Nov 20, 2016

inf64 said:
40% is ST IPC and was confirmed by AMD fellow at the HC QnA session:
https://mobile.twitter.com/hashtag/HotChips?src=hash
https://mobile.twitter.com/Daniel_Bowers/status/768270633125806081

That's what I was looking for. Thanks!

I stand corrected, but their HC slide is incorrect then.

Sent from HTC 10
(Opinions are own)

majord · Nov 20, 2016

Arachnotronic said:
GB4 score is calculated using geometric mean.

https://en.m.wikipedia.org/wiki/Geometric_mean

Thanks, found the page explaining their scores. Being relative to base-line, that makes sense to use geometric mean.

That does not invalid the point though, it will still skew results significantly.

Anyway I did some testing, and also threw in a sandy bridge i3 result, i3 for two reasons: no turbo to mess with results, and my Skylake system is an i3,

I should say, I have no problem with these tests "skewing" the end result, providing geekbench is seen as one application amongst a wider test suite.. After all this is how "non-synthetic" applications end up with different performance across achitectures - certain code sections may run an order of magnitude slower on a given architecture, resulting in a larger than expected overall performance delta. The problem is when using GB to summarise performance/IPC full-stop.

witeken · Nov 20, 2016

majord said:
Anyway I did some testing, and also threw in a sandy bridge i3 result, i3 for two reasons: no turbo to mess with results, and my Skylake system is an i3,

That difference between Sandy Bridge and Skylake, though.

majord · Nov 20, 2016

Yeah, I was thinking of throwing the SKL SNB comparison column in .. for the record SKL is 1.25 (25%) > SNB with AES removed.

AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Member

Golden Member

Lifer

Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Senior member

Member

Lifer

Diamond Member

Golden Member

Senior member

Diamond Member

Senior member

Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member