AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Page 74 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

cdimauro

Member
Sep 14, 2016
163
14
61
Mean IPC in consumer code is around 1/cycle or lower. HPC FPU code can reach 2.5 (e.g. Spec FP). Almost only power viruses can go over 3.
AMD can do 4 INT PLUS 4 FP PLUS 2 MEM.
You have to see how many uops can be dispatched per clock cycle by the micro-op cache: this is the bottleneck.

Intel processors can dispatch 8 of them, and the scheduler works in the best condition (it has the complete "vision" and can make the best decision).

Last but not really least, the decoder can decoder up to 4 instructions, and this is one of the most important bottlenecks for ST-code.
INTEL can do 4 among FP and INT (and FP are limited to 2 true FP + 2 vecint) and 3-4 MEM.
Not counting the 256-bit AVX instructions.
So in some, if not most, cases AMD's SMT can even beat INTEL's...
Well, if Blender was the best case, as I think, there'll be only a few cases where it can happen, and by a very small margin.

Which isn't a good result.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
It's not 40% faster,it's 40% more IPC,that's throughput not speed,it will only be 40% faster if you actually find a software that will be able to use all 10 instructions the ZEN core has available per cycle.
Which will be pretty difficult since there aren't many CPUs out there (if there are any) with 10 instructions per core,I guess that's why they went with blender instead of some "traditional" benchmark.
I think we were past this point already. But 40% could also be reached by comparing Zen with a CPU, which only achieves a 29% lower t'put, as you say.

This assumption of the need to have a constant 100% utilization also leads to XV already achieving ~100% on 7 FUs for 1T. This also would mean that 40% can't be average, as it is already the max. value.

Thus you are implicitly stating that XV should have 2-7x IPC of SKL based on SKL real world measurements.

We had this before. Please verify your assumptions.
 
Reactions: Doom2pro
Mar 10, 2006
11,715
2,012
126
did you look at the outliers which made up the score?

the most significant of which (from the first A12 set):

HTML5 DOM:

Bristol ridge :932
Haswell-E : 3908

= 420% IPC advantage

Then there's the inclusion of Memory performance in the test

I'm not saying all workloads should show a consistent performance delta between architectures, they never do, but you do have to be able to recognise an outlier that's so extreme it's capable of skewing an IPC comparison by 20-30%

This, and the inclusion of things like AES (Which work back in Excavators favor) are reasons why Geekbench is not very useful for architecture comparisons

GB4 score is calculated using geometric mean.

https://en.m.wikipedia.org/wiki/Geometric_mean
 
Mar 10, 2006
11,715
2,012
126
I thought we had this discussion already... it's likely due to the 9800's lack of L3. The 8350 for instance gets around 3400 on the HTML 5 DOM test. If anything, it's great because it highlights weaknesses of a chip that you might not obviously see with just one test.

We did have this discussion before and I shot down your thesis by pointing out that th A9X performs as well per clock as the A9 in this test. A9X has no l3$, while A9 does.
 

jpiniero

Lifer
Oct 1, 2010
14,841
5,456
136
We did have this discussion before and I shot down your thesis by pointing out that th A9X performs as well per clock as the A9 in this test. A9X has no l3$, while A9 does.

It's not the L3, it's the lack of L3 on Bristol Ridge. Those are two separate things. You'd have to come up with a good reason that Vishera is so much faster to think otherwise.
 

TheELF

Diamond Member
Dec 22, 2012
3,993
744
126
I think we were past this point already. But 40% could also be reached by comparing Zen with a CPU, which only achieves a 29% lower t'put, as you say.

This assumption of the need to have a constant 100% utilization also leads to XV already achieving ~100% on 7 FUs for 1T. This also would mean that 40% can't be average, as it is already the max. value.

Thus you are implicitly stating that XV should have 2-7x IPC of SKL based on SKL real world measurements.

We had this before. Please verify your assumptions.
Did they state that it's 40% avg?
Yes 40 % could be throughput only it could be speed only it could be a mix of both or it could be whatever.
When people here state same IPC as... what are they refering to?Speed or throughput or avg in a number of cases?
What do you call the speed if you have one thread that only needs one command per cycle?That's where intel is so much better,it has a lot of things that improve cycles per instruction.

Too many unknowns...
 
Mar 10, 2006
11,715
2,012
126
Did they state that it's 40% avg?
Yes 40 % could be throughput only it could be speed only it could be a mix of both or it could be whatever.
When people here state same IPC as... what are they refering to?Speed or throughput or avg in a number of cases?
What do you call the speed if you have one thread that only needs one command per cycle?That's where intel is so much better,it has a lot of things that improve cycles per instruction.

Too many unknowns...

IPC is commonly agreed on to mean "single threaded performance per clock."
 

bjt2

Senior member
Sep 11, 2016
784
180
86
You have to see how many uops can be dispatched per clock cycle by the micro-op cache: this is the bottleneck.

6 uop cycle from the ucache is enough for two heavy duty 3 IPC threads... And 10 units are eough for the peak and keep 6 uops/cycle steady execution flux (on optimized code and without cache misses, obviously)

Intel processors can dispatch 8 of them, and the scheduler works in the best condition (it has the complete "vision" and can make the best decision).

http://www.agner.org/optimize/blog/read.php?i=415 According to agner, 4 fused uops, without fusion limits, so it is correct that the limit is 8 uops, but rarely there are all this fused ops...

Last but not really least, the decoder can decoder up to 4 instructions, and this is one of the most important bottlenecks for ST-code.

Yes but we know how dense are the AMD MOPs, up to 3 uops (or an RMW intruction), and from the uop cache, if the hit rate is enough, we optain 6 uops cycle

Not counting the 256-bit AVX instructions.

In 256 bit is correct that you have the same (fmul or fadd) or double (fmac) throughput of Zen, because it can do 2x256 bit FMAC and 2x256 VECINT, but in that case no int operations because all 4 ports are occupied, while zen can do 4 int ops in parallel. This is what i said. Without specifying bits. But it can do 1 FMUL+1 FADD or 1 FMAC+ 1 FADD or 1 FMAC + 1 FMUL or FMAC and for legacy x87 or 128 bit code Zen throughput is equal or superior: 2 FMUL + 2 FADD or 1 FMAC + 1 FMUL + 1 FADD or 2 FMAC

Well, if Blender was the best case, as I think, there'll be only a few cases where it can happen, and by a very small margin.

Which isn't a good result.

I don't think that it's an outlier... Surely it's the best case, but looking at the decoders, uop cache, cache bandwidth, number of pipeline, i don't think its IPC will be so low...

Regarding SMT... At 128 bit certainly the higher ports helps to have higher MT IPC...
 

cdimauro

Member
Sep 14, 2016
163
14
61
IPC is commonly agreed on to mean "single threaded performance per clock."
I don't subscribe it, as we already discussed here around 1 month ago.
6 uop cycle from the ucache is enough for two heavy duty 3 IPC threads... And 10 units are eough for the peak and keep 6 uops/cycle steady execution flux (on optimized code and without cache misses, obviously)
And according to the Blender results, Broadwell(-E) resources guaranty almost exactly the same result.
According to agner, 4 fused uops, without fusion limits, so it is correct that the limit is 8 uops, but rarely there are all this fused ops...
Sure, but Intel's scheduler has a better chance to feed its ports, since it has an exact "vision" of the platform in a precise moment.
Yes but we know how dense are the AMD MOPs, up to 3 uops (or an RMW intruction), and from the uop cache, if the hit rate is enough, we optain 6 uops cycle
Intel's uop can also carry a lot of "actions/operations". That's why you see similar results in Blender. And that's why in has shown much better ST performance.

I pretty curious to see how well Zen can perform with heavy ST code, like emulators, compilers, databases, etc..
In 256 bit is correct that you have the same (fmul or fadd) or double (fmac) throughput of Zen, because it can do 2x256 bit FMAC and 2x256 VECINT, but in that case no int operations because all 4 ports are occupied, while zen can do 4 int ops in parallel. This is what i said. Without specifying bits. But it can do 1 FMUL+1 FADD or 1 FMAC+ 1 FADD or 1 FMAC + 1 FMUL or FMAC and for legacy x87 or 128 bit code Zen throughput is equal or superior: 2 FMUL + 2 FADD or 1 FMAC + 1 FMUL + 1 FADD or 2 FMAC
FMAC is emulated by FMAC + FMADD in Zen, as you know. That's why it can't reach Intel performance in this case.
Intel has also two symmetrical FPU units, that can do any FMAC/FMUL/FADD. So it can do 2 256-bit FMACs or FMULs or FADDs, whereas Zen is limited to 1 of those (per type).

It can also do 2 VALU or VShift (on the same port), and there's another one for VALU or VShuffle.

On Zen we don't know how is the situation, but if it follows a design similar to the Jaguar one, you only have 2x128-bit = 1x256 bit VShift or "conversion", and 4x128-bit = 2x256-bit VALU; I don't know about the shuffle operations.

And you are wrong about the integer units: even with all 3 Int/FPU ports busy, there's always one which is free for "integer" (ALU, Shift, Branch) operations.

Plus 2x256-bit load, 1x256 bit store, and one AGU units. Whereas here Zen has only 2 ports (the 2 AGUs) where you can submit memory operations.
I don't think that it's an outlier... Surely it's the best case, but looking at the decoders, uop cache, cache bandwidth, number of pipeline, i don't think its IPC will be so low...
I haven't said that it's low, but I think that it's likely that it'll be lower than Intel counterparts.

IF (and only if) AMD's statement is correct, 40% more IPC compared to XV can give an estimate.
Regarding SMT... At 128 bit certainly the higher ports helps to have higher MT IPC...
We have already seen Blender results...
 

TheELF

Diamond Member
Dec 22, 2012
3,993
744
126
Well it buggs me what can I say,it gives any company card blanche to use technical terms to misslead it's customers without lying.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Well it buggs me what can I say,it gives any company card blanche to use technical terms to misslead it's customers without lying.
I have no evidence that they did, in both cases. Do you? Is there a ruling in the case about PD cores, yet?

Problems could arise if there is room for interpretation. Whoever touched areas like science or requirement engineering will know what I mean.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
And you are wrong about the integer units: even with all 3 Int/FPU ports busy, there's always one which is free for "integer" (ALU, Shift, Branch) operations.

This because the FP ports are three, of which two can do FP and all three VECINT, if I remember well... I thought that also the fourth port could do vecint ops, my bad... That would not be so bad, because the scheduler could have given high priority to the int ops... But if the fourth is free of FP and VECINT, it's better for the scheduler (simpler) and bad only for VECint performance...

IF (and only if) AMD's statement is correct, 40% more IPC compared to XV can give an estimate.

I think that 40% is a minimum value... In this thread was shown that the lack of L3 in XV hurts performance a lot.
The differences in blender are more 70-80% than 40% and now Zen is on par... So the maximum gain is more in the 70-80% ballpark than 40%... Moreover frontend (uop cache and decoders), backend (queues, stack engine, 2x retire rate) and cache (L1 L2 and L3) are improved a lot from XV, so i am not surprised of this huge gain...
 

coercitiv

Diamond Member
Jan 24, 2014
6,400
12,858
136
I don't subscribe it, as we already discussed here around 1 month ago.
Yes, and when pressed to discuss how that would apply to CPUs with the same micro architecture but different core count, you began repeating the definition of IPC, which states it is application specific - hence dependent on the amount of threads said app can use. Once we go on that strict road we can only mention IPC in relation to a specific app, which for discussions on this forum is a hindrance.

That is why, whether you like or not, people have commonly agreed to equate IPC to "single threaded performance per clock", while using the term "throughput" to indicate multi-threaded scenarios, and only "performance" when clocks are also taken into consideration. It's turned out to be quite a productive arrangement, helping people get their point across, but by all means, go ahead try to change that for your own comfort!
 

imported_ats

Senior member
Mar 21, 2008
422
63
86
You have to see how many uops can be dispatched per clock cycle by the micro-op cache: this is the bottleneck.

Intel processors can dispatch 8 of them, and the scheduler works in the best condition (it has the complete "vision" and can make the best decision).

Last but not really least, the decoder can decoder up to 4 instructions, and this is one of the most important bottlenecks for ST-code.

No, no it is not. For ST code, branches and loads are the most important bottlenecks. Neither of those really depends on instruction decoding or scheduling. Most code is extremely fortunate if it gets even close to 2 IPC.
 

cdimauro

Member
Sep 14, 2016
163
14
61
This because the FP ports are three, of which two can do FP and all three VECINT, if I remember well... I thought that also the fourth port could do vecint ops, my bad... That would not be so bad, because the scheduler could have given high priority to the int ops... But if the fourth is free of FP and VECINT, it's better for the scheduler (simpler) and bad only for VECint performance...
3 256-bit VecInt ports should enough.
I think that 40% is a minimum value... In this thread was shown that the lack of L3 in XV hurts performance a lot.
The differences in blender are more 70-80% than 40% and now Zen is on par... So the maximum gain is more in the 70-80% ballpark than 40%... Moreover frontend (uop cache and decoders), backend (queues, stack engine, 2x retire rate) and cache (L1 L2 and L3) are improved a lot from XV, so i am not surprised of this huge gain...
It's AMD that stated the +40% IPC over XV, not me. Don't you trust what AMD reported?
Yes, and when pressed to discuss how that would apply to CPUs with the same micro architecture but different core count, you began repeating the definition of IPC, which states it is application specific - hence dependent on the amount of threads said app can use. Once we go on that strict road we can only mention IPC in relation to a specific app, which for discussions on this forum is a hindrance.
Guess what: you run/test specific applications to get some IPC number.
That is why, whether you like or not, people have commonly agreed to equate IPC to "single threaded performance per clock", while using the term "throughput" to indicate multi-threaded scenarios, and only "performance" when clocks are also taken into consideration. It's turned out to be quite a productive arrangement, helping people get their point across, but by all means, go ahead try to change that for your own comfort!
Well, nobody stops to take other metrics to measure something else.

If the IPC definition isn't good, you can define a new metric. Everything depends on what you want to measure.
No, no it is not. For ST code, branches and loads are the most important bottlenecks. Neither of those really depends on instruction decoding or scheduling. Most code is extremely fortunate if it gets even close to 2 IPC.
I only wanted to stress the fact that the decoder itself puts a limit on the number of instructions that can be executed.

To be more clear: you can have tens of ports/executing units, but in ST code you have this upper limit for the best cases, and so the number of ports that can happily be used.

Which is not the common/normal case. As you reported, having many branches (and chains of dependent instructions; especially bases on loads) hurts the IPC... and can stress a lot the decoders. If you have to frequently change the executed code, having queues for micro-ops can be of little or no help here, and the decoder only guarantees a maximum of 4 of instructions to feed the backend.

That's why I want to see some benchmarks with emulators, etc..
 

bjt2

Senior member
Sep 11, 2016
784
180
86
It's AMD that stated the +40% IPC over XV, not me. Don't you trust what AMD reported?

AMD didn't say if the 40% was minimum, mean, maximum, but looking at blender, the maximum is at least +80%. This imply that 40% at least is the mean value, that is reasonable. Stating that 40% is the minimum in an official statement is very dangerous, because a single example of less than 40% gain would suffice to be wrong.
Thinking better on this subject, i think (and probabily can be verified lokking at the small notes) that AMD intended mean IPC, as it is reasonable.

To be more clear: you can have tens of ports/executing units, but in ST code you have this upper limit for the best cases, and so the number of ports that can happily be used.

Which is not the common/normal case. As you reported, having many branches (and chains of dependent instructions; especially bases on loads) hurts the IPC... and can stress a lot the decoders. If you have to frequently change the executed code, having queues for micro-ops can be of little or no help here, and the decoder only guarantees a maximum of 4 of instructions to feed the backend.

That's why I want to see some benchmarks with emulators, etc..

In the case of Zen you can safely assume a weighted mean between 4 and 6 upos/cycle, weighted by the uop cache hit/miss ratio. Because if there is a branch misprediction, the uops can still be taken on the uop cache, so if there is an hit, the throughput is still 6 uops/cycle.
You don't need the decoder again, except if the branch is in a never taken piece of code... But we are talking of loops here, right? Because a single path without loops has a negligible execution time, compared to loops of code and so in the total performance evaluation can be safely ignored. We are talking of program that take at least seconds to be executed, so piece of code executed only one time and first execution of a loop, were the uop cache probabily would miss, are a negligile fraction of the total execution time...
 

FIVR

Diamond Member
Jun 1, 2016
3,753
911
106
Geez, 74 pages and 73 of them are special friends ELF and Cidimauro arguing about what "40%" means, as if that has any bearing on current zen performance whatsoever. Can we give these trolls their own thread to fill with this drivel? It's pedantic and totally unrelated to technology, it's also threadcrapping at this point.
 

KTE

Senior member
May 26, 2016
478
130
76
I expected 40% to be the minimum actually, and I have seen it quoted somewhere here, but when searching I could not find a direct reference, except:



First point suggests pretty clearly that 40% is with the SMT. Otherwise why list it under how the 40% IPC is gained?

Going off HotChips, I'd say it's IPC pretty clearly, and that means per core. Not 40% performance.

Sent from HTC 10
(Opinions are own)
 

inf64

Diamond Member
Mar 11, 2011
3,764
4,223
136
I expected 40% to be the minimum actually, and I have seen it quoted somewhere here, but when searching I could not find a direct reference, except:



First point suggests pretty clearly that 40% is with the SMT. Otherwise why list it under how the 40% IPC is gained?

Going off HotChips, I'd say it's IPC pretty clearly, and that means per core. Not 40% performance.

Sent from HTC 10
(Opinions are own)

40% is ST IPC and was confirmed by AMD fellow at the HC QnA session:
Question: Did 40% uplift on Zen IPC include boost from SMT? AMD: No, that was just 1-thread improvement. #HotChips
https://mobile.twitter.com/Daniel_Bowers/status/768270633125806081

I think it is clear that 40% is just an average number they got from a mix of ST workloads (mostly int). SMT could bring 20-30% more performance when 2 threads run on one core meaning thread for thread Zen could be even 80%+ better in MT workloads.
 

majord

Senior member
Jul 26, 2015
444
533
136
GB4 score is calculated using geometric mean.

https://en.m.wikipedia.org/wiki/Geometric_mean

Thanks, found the page explaining their scores. Being relative to base-line, that makes sense to use geometric mean.

That does not invalid the point though, it will still skew results significantly.

Anyway I did some testing, and also threw in a sandy bridge i3 result, i3 for two reasons: no turbo to mess with results, and my Skylake system is an i3,

I should say, I have no problem with these tests "skewing" the end result, providing geekbench is seen as one application amongst a wider test suite.. After all this is how "non-synthetic" applications end up with different performance across achitectures - certain code sections may run an order of magnitude slower on a given architecture, resulting in a larger than expected overall performance delta. The problem is when using GB to summarise performance/IPC full-stop.


 
Last edited:

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
Anyway I did some testing, and also threw in a sandy bridge i3 result, i3 for two reasons: no turbo to mess with results, and my Skylake system is an i3,
That difference between Sandy Bridge and Skylake, though.
 
Last edited:

majord

Senior member
Jul 26, 2015
444
533
136
Yeah, I was thinking of throwing the SKL SNB comparison column in .. for the record SKL is 1.25 (25%) > SNB with AES removed.
 
Status
Not open for further replies.
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |