AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Page 35 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

cdimauro

Member
Sep 14, 2016
163
14
61
It's not comparable, because x86 is different. x86 has too much complex logic to translate complex instruction set into simple instruction set thus consume a lot more than any other ISA.
See above my other comment on this.
As you can see x86 core is quite big when compare to ARM.
Not always true. Compare Apple's A9X with Intel's Core-M Skylake: both dual cores, but the latter is a bit smaller (there are some die shots made by some site).
Discussion about frequency of an architecture is not difficult, but it would be quite unrealistic when comes to a specific product. Zen ES @ 32core start from 1.4Ghz according to Geekbench might be a hint, but there's nothing about its TDP.
Geekbench is a synthetic benchmark which means nothing.

It's better to use real-world applicationS (plural: NOT only some, or even one).
 

cdimauro

Member
Sep 14, 2016
163
14
61
As I stated above, A9 has six decoders and no uop cache. This should more than compensate the x86 complexity.
I don't think so: take a look at my previous message on this topic.

Anyway, supposing that what you said is true: how did you measured such "compensation"? Have you made some calculations? Can you show them, so we can check how you got the numbers and verify them?
Moreover A9 pipeline lenght is 16 stages versus 19 of Zen. So Zen has a lower FO4 delay
Can you prove this relationship? To recap: do more pipeline states means lower FO4? How?
 

cdimauro

Member
Sep 14, 2016
163
14
61
MIPS are MIPS and FLOPS are FLOPS. If you are talking about decoders, Zen has uop cache and decoders will be gated 50-90% of the time...
Moreover I underestimated all positive factors...
Well, previously you have written this:
"Zen has the uop cache, that reduces the consumption by the cache hit rate (I think between 50% and 80%)"
And now you increased the second number by 10%. So, where did you got such numbers, and why you're changing them from time to time?
- Assumed that Apple can do a custom design and that A9X is not an ASIC, fact that is not sure...
ASICS has FO4>=30, custom designs have FO4<=25... At least 20% more clock at ISO power for Zen...
Even admitting A9X is a custom design,
It's highly probable that it is custom, since in the past Apple acquired not only PA-Semi (which produced power-efficient PowerPC SoCs, mostly used for mission critical projects), but also Intrinsity, which was well known for its custom logic designs (see also the Titan project from AMCC, which largely used Intrinsity's technology / expertize).
it has 16 stages versus 19, so in any case Zen can have clock 20% higher at the same Vcore.
Can you also prove that you (always) get a 20% clock increase on a chip, jumping from 16 to 19 stages?
 

bjt2

Senior member
Sep 11, 2016
784
180
86
I don't think so: take a look at my previous message on this topic.

What is the hit rate of an uop cache? I posted 50%-80% because these were the estimations i found in other sites... Obviously will be 50% if it's small and 80-90% if it's big...

Anyway, supposing that what you said is true: how did you measured such "compensation"? Have you made some calculations? Can you show them, so we can check how you got the numbers and verify them?

Can you prove this relationship? To recap: do more pipeline states means lower FO4? How?

I already done this calculation some posts ago, but briefly:
Let's assume decoder power being no more than 1/3 of total TDP (otherwise using x86 would be antieconomic)
With 80% hit rate, even if the x86 decoders draw double the power of ARM decoders, the penality on the TDP is 20%*1/3*2=13% on the total TDP.
I assumed that the 2 ARM core in the A9X does not draw all 5W because there is GPU, NB and SB. This can more than compensate the +13% given by the x86 penality
Regarding FO4: we know that Bulldozer has at most 20 stage pipelines: this is the maximum misperdiction penality... The articles on the internet ranges from 15 to this 20, also because there were not a paper that stated this precisely...
For Zen we know that it has 19 stage pipeline and A9X 16.
Most ARM design has FO4>=30, because are ASICs and reach anyway 2.5GHz in 4+4 big.little configurations (I recall Snapdragon 8x0, the top model at least), versus 1.85x2 of A9 and 2.26x2 on A9X
Anyway Zen pipelines seems to be similar to BD pipelines, that have a low FO4, estimated to be 17, giving the high clocks, up to 4.2GHz on 28nm BULK.
All these data let me think that Zen FO4 is similar to Bulldozer's and inferior to A9X's...
But even if Zen has same FO4 as A9X, the whole point of the calculations were to infer the feasibility of a 32c Zen core at 2GHz and 95W and given these data, i think that yes, you can...
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
It is not the delay of an inverter but the delay of a chain of three inverters with each inverter output being loaded by 4 inverters inputs.

To explain it simply an inverter drive four inverters, we take one of those 4 inverters and load its output with 4 inverters, so there s 3 stages, in the pic below there s two stage, so we add 4 gates that are driven by say the upper gate, the delay at the output of the third consecutive gate is the FO4 delay.

Nope. The "delay of an inverter" is closer to being correct. FO4 just describes the driving/loading condition of the inverter getting measured.

You need a realistic driver because input slope affects device delay and so the driver of the inverter should be sized 1/4th the size of the inverter under test.
You need a realistic load because an unloaded inverted (self loading only) makes it faster, so you put in 4 inverters of equal size as the load
And then usually for good measure, you want to do a FO4 loading on the load inverter due to deal with the miller cap issues on an unloaded inverter
 
Last edited:
Reactions: KTE

bjt2

Senior member
Sep 11, 2016
784
180
86
Well, previously you have written this:
"Zen has the uop cache, that reduces the consumption by the cache hit rate (I think between 50% and 80%)"
And now you increased the second number by 10%. So, where did you got such numbers, and why you're changing them from time to time?

It's highly probable that it is custom, since in the past Apple acquired not only PA-Semi (which produced power-efficient PowerPC SoCs, mostly used for mission critical projects), but also Intrinsity, which was well known for its custom logic designs (see also the Titan project from AMCC, which largely used Intrinsity's technology / expertize).

Can you also prove that you (always) get a 20% clock increase on a chip, jumping from 16 to 19 stages?

Mine were rough estimations. Indeed in my last post I used 80% to stay safer.
You are right that 16 to 19 stages does not imply +20% clock, but as I said in the previous post, even if the FO4 is the same, the point was infer the feasibility of a 32c Zen 2Ghz 95W...

But since BD is 20 stages, we can deduce that Zen has similar Fo4 since are both x86 design of the same producer...
If the A9X has a low Fo4, why it does not reach at least low bulldozer clocks? Excavator on 28nm BULK draw 5W at 2.4GHz for two cores... A9X is on 14nmFF and draws 5W at 2.26GHz, and you all are saying that x86 is more complex of ARM and draw more power... So either Zen and BD has lower Fo4 than a9x or x86 tdp penality is not that much...
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,912
3,524
136
Any thing to show bulldozer is 20 stages. I used to think it was that long, but now can only find comments from AMD saying its somewhere around 15/16. It could be 20ish for Floating Point.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
Any thing to show bulldozer is 20 stages. I used to think it was that long, but now can only find comments from AMD saying its somewhere around 15/16. It could be 20ish for Floating Point.

But for Zen Dresdenboy posted a paper that states that Zen INT pipeline is 19 stages. This is SURE.
Since BD reaches 4.2GHz on the shitty 28nm BULK and 5GHz on the 32nm SOI, i am incline to believe to BD having low FO4 and thus many stages... 20 stages is not much more than Skylake eh!

If Bulldozer is not 20 stages, this is a GREAT news... Because skylake is about 17 stages...
 

cdimauro

Member
Sep 14, 2016
163
14
61
What is the hit rate of an uop cache? I posted 50%-80% because these were the estimations i found in other sites... Obviously will be 50% if it's small and 80-90% if it's big...
It depends on the specific implementation.

Intel claimed about 80% hit ratio for its LSD (on the first implementation).
I already done this calculation some posts ago, but briefly:
Let's assume decoder power being no more than 1/3 of total TDP (otherwise using x86 would be antieconomic)
With 80% hit rate, even if the x86 decoders draw double the power of ARM decoders, the penality on the TDP is 20%*1/3*2=13% on the total TDP.
I assumed that the 2 ARM core in the A9X does not draw all 5W because there is GPU, NB and SB. This can more than compensate the +13% given by the x86 penality
Regarding FO4: we know that Bulldozer has at most 20 stage pipelines: this is the maximum misperdiction penality... The articles on the internet ranges from 15 to this 20, also because there were not a paper that stated this precisely...
For Zen we know that it has 19 stage pipeline and A9X 16.
Most ARM design has FO4>=30, because are ASICs and reach anyway 2.5GHz in 4+4 big.little configurations (I recall Snapdragon 8x0, the top model at least), versus 1.85x2 of A9 and 2.26x2 on A9X
Anyway Zen pipelines seems to be similar to BD pipelines, that have a low FO4, estimated to be 17, giving the high clocks, up to 4.2GHz on 28nm BULK.
All these data let me think that Zen FO4 is similar to Bulldozer's and inferior to A9X's...
But even if Zen has same FO4 as A9X, the whole point of the calculations were to infer the feasibility of a 32c Zen core at 2GHz and 95W and given these data, i think that yes, you can...
I already stated before (and also on the Italian's forum which we frequent) that ARM and x86 are too much different. Aside this, there are too much assumptions / if. For those reasons I don't think that such comparisons, and speculations too, don't make sense.

Anyway, I asked a different question before, regarding your previous statement. Should I assume that there's no prove for it?
Mine were rough estimations. Indeed in my last post I used 80% to stay safer.
You are right that 16 to 19 stages does not imply +20% clock,
OK, so there's no law/theorem behind it: just speculation.
but as I said in the previous post, even if the FO4 is the same, the point was infer the feasibility of a 32c Zen 2Ghz 95W...
But you're comparing ARMs and x86s, which are too much different both in ISA and implementations: I don't subscribe it. It's already difficult to compare microimplementations of the same ISA...
But since BD is 20 stages, we can deduce that Zen has similar Fo4 since are both x86 design of the same producer...
If the A9X has a low Fo4, why it does not reach at least low bulldozer clocks? Excavator on 28nm BULK draw 5W at 2.4GHz for two cores... A9X is on 14nmFF and draws 5W at 2.26GHz, and you all are saying that x86 is more complex of ARM and draw more power... So either Zen and BD has lower Fo4 than a9x or x86 tdp penality is not that much...
See above, but, well, I add another thing here: did you took a look at the ISA of ARMs and x86? x86 does more "useful work" on several instructions (and there are MANY of them), but it pays it in terms of power consumption (due to the more complex ALUs and FPUs).

But it can make sense IF the code can make use of such complexity. Which isn't always the case, albeit x86s shown very good and extremely competitive performances compared to the most "noble" RISC processors.

So, again, too much differences which makes not possible, IMO, to compare the two ISAs and their implementations.
 

Abwx

Lifer
Apr 2, 2011
11,517
4,303
136
Nope. The "delay of an inverter" is closer to being correct. FO4 just describes the driving/loading condition of the inverter getting measured.

You need a realistic driver because input slope affects device delay and so the driver of the inverter should be sized 1/4th the size of the inverter under test.
You need a realistic load because an unloaded inverted (self loading only) makes it faster, so you put in 4 inverters of equal size as the load
And then usually for good measure, you want to do a FO4 loading on the load inverter due to deal with the miller cap issues on an unloaded inverter

And to get a realistic measure you need to chain several gates...

The first gate on the path although subject to a loading effect due to being loaded by four gates will neverless be somewhat immune to miller effect due to the low output impedance of the generator, as such it will be abnormaly fast, hence the need to buffer with successive gates such that the third serie is driven by a device that is fully subject to miller effect while being driven by a device that present a realistic slew rate.

Generaly test is done with a ring modulator and more than 4 gates series, such a ring modulator, that consist of a few handfulls of transistors, is implemented in any silicon as it allows to measure the characteristics of subsequent waffers and to improve gradually the process.
 
Last edited:

bjt2

Senior member
Sep 11, 2016
784
180
86
It depends on the specific implementation.

Intel claimed about 80% hit ratio for its LSD (on the first implementation).

I already stated before (and also on the Italian's forum which we frequent) that ARM and x86 are too much different. Aside this, there are too much assumptions / if. For those reasons I don't think that such comparisons, and speculations too, don't make sense.

Anyway, I asked a different question before, regarding your previous statement. Should I assume that there's no prove for it?

OK, so there's no law/theorem behind it: just speculation.

But you're comparing ARMs and x86s, which are too much different both in ISA and implementations: I don't subscribe it. It's already difficult to compare microimplementations of the same ISA...

See above, but, well, I add another thing here: did you took a look at the ISA of ARMs and x86? x86 does more "useful work" on several instructions (and there are MANY of them), but it pays it in terms of power consumption (due to the more complex ALUs and FPUs).

But it can make sense IF the code can make use of such complexity. Which isn't always the case, albeit x86s shown very good and extremely competitive performances compared to the most "noble" RISC processors.

So, again, too much differences which makes not possible, IMO, to compare the two ISAs and their implementations.

It's correct that I don't have proofs of the correctness of my statements, but they were with an high margin of overestimation... 2GHz vs 2.26, same FO4 (that is not the case), 5W all used by the CPU (discarding GPU, NB and SB)... So many safety margins that I doubt I am very far from the truth... Anyway you are right: tecnically i don't have any proof of my statements. But you should admit that are at least reasonable, if not probable...
 

bjt2

Senior member
Sep 11, 2016
784
180
86
ARM CPUs do not have decoders, ALU, shifter, FPU, FADD, FMUL, scheduler, retire unit etc? Can't do the exact thing than x86? They are functionally equivalent, because I can write a C++ program, compile it for ARM and x86, and they performs exactly the same calculation. Even a Turing machine can do this. The problem is the speed and the energy consumed. There exist multiplatform benchmark, like the one to measure browser speed and wattmeters to measure power. You can use even the same browser, e.g. Chrome, recompiled for android and windows, and linux, and OS X. You can even pick an old Power PC Mac and do the same measure.
Xeon INTEL CPU for istance, were benched versus Power 8 with the same high end benchmark and win both in speed, that in power consumption... And Atom has ever beat almost every ARM architecture (except, maybe, Apple's) in performance, with similar power consumptions...

I know they are different. Even a Ferrari and a Porsche are different. But both can be used to go from A to B. With different speed and consumption...
 

bjt2

Senior member
Sep 11, 2016
784
180
86
True. And now?
One can use a set of benchmark, even real life applications, a power meter and decide which is the best.

My point is that even if the isa is different, if the functional units are similar (e.g. same number, same bit width), the power consumption can't be too different, provided that the FO4 is the same.

Apple used some standard library to design its ALU and FPU. If they performs x calculations per second, the power consumption is the same, provided that the FO4 is the same. A FADD is a FADD. We can implement with various FO4 depending on CPU target speed.

I was trying to estimating the power consumption of Zen at 2GHz. Provided tha the units count is similar, provided tha the process is the same, and so the library, if the FO4 is the same, then the CPU power consumption should be similar. But if Zen FO4 is lower, then the consumption AT 2GHz is lower...

I know that the ISA are different, but my point was to roughly estimate the Zen TDP at about 2GHz. Give an upper limit, at least.
 

cdimauro

Member
Sep 14, 2016
163
14
61
The problem is that the functional units are completely different.

I repeat again: have you took a look at the specific ISAs? How can you state that an ARM FPU is functionally similar to an x86 one, except that for a very high level point of view?
 

bjt2

Senior member
Sep 11, 2016
784
180
86
What is wrong with my rough calculations? THey are too optimistic? Why? And please don't say, because ARM and x86 are different. I know. But at high level a software that e.g. does a convolution filter, at low level will perform almost the same operations. On similar FADD, FMUL etc. The only difference is coding of the instruction, the final speed and the final consumption.

If I need 1 million FADD and 1 million FMUL to do a calculation, and perform this calculation on an x86 CPU and an ARM CPU implemented with the same process and library, with low level functional unit with the same FO4, clock and Vcore, then the power consumption should be roughly similar, if the 2 CPU performs about the same in terms of time... E.g. a program that do this filter on the A9X at 2.26 GHz and Zen at 2.26 GHz that perform this calculation in about the same time (+-20%)...
 

bjt2

Senior member
Sep 11, 2016
784
180
86
The problem is that the functional units are completely different.

I repeat again: have you took a look at the specific ISAs? How can you state that an ARM FPU is functionally similar to an x86 one, except that for a very high level point of view?

A9X and Zen are implemented on the same 14nm FF process of samsung. The libary cell is usually provided by the foundry. Then you can apply tweaks to get more speed or lower power. But remember that I am doing rough calculation. A FADD is a FADD. A FMUL a FMUL. A convolution filter is a convolution filter. Requires x FADD and y FMUL. If the low level units are at the same clock, vcore and FO4, on the same process and perform the same calculation in about the same time, why they should draw very different powers? I was aiming at a rough estimation in the range of +-10-20%. Just to say that surely a 32c Zen CPU is feasible at 2GHz and under 95W... I do not want to know exact clocks and power consumption...
 

cdimauro

Member
Sep 14, 2016
163
14
61
Then you are measuring the consumption of the respective CPUs, but you cannot make assumptions about one based to the results of the other one, because they are completely different.

In fact, the difference is not only on the coding of the instructions, like you said, but how the respective FPUs get the instructions to execute, and do the real job. And, of course, if you disassembly the binaries, you'll find a very different set of instructions which is generated.
 

Abwx

Lifer
Apr 2, 2011
11,517
4,303
136
The problem is that the functional units are completely different.

Doesnt matter to the extent that they mentioned the efficency of the new units in respect of the older ones..



Same power comsumption/cycle as Excavator, to summarize comsumption at same frequency but at twice the FP throughput for Zen is comparable.

At 3.5GHz an XV core consume 8W in Cinebench R15, so 8 Zen core should be within 65W not counting the uncore but this latter will also be much more efficient than in XV derived APUs, so the total figure should be about 80W.
 

cdimauro

Member
Sep 14, 2016
163
14
61
1) Excavator isn't an ARM processor.
2) You're using AMD's slides.

I prefer to wait for a power consumption analysis from tech sites like AnandTech, to see what really happens... on the real world.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
Then you are measuring the consumption of the respective CPUs, but you cannot make assumptions about one based to the results of the other one, because they are completely different.

In fact, the difference is not only on the coding of the instructions, like you said, but how the respective FPUs get the instructions to execute, and do the real job. And, of course, if you disassembly the binaries, you'll find a very different set of instructions which is generated.

Yes, but since the calculation done is the same, the useful switching are the same and if the low level units are the same, the power consumption of the FPU and ALU is the same. What changes is the consumption of the rest of the CPU. Even if the difference is +-50%, the most part of the power consumption is given by the FPU and ALU, at least 2/3. I was aiming at a rough estimation. More of an upper limit.
I know the x86 decoders are much power hungry. But probabily A9x does not have uop cache and probabily Zen's uop cache hit rate is about 80%, that means that the decoders draw 1/5 of the power... This can be sufficient to make their power consumption similar to 4-6 ARM decoder? How much power an x86 decoder draw? 1W?!?!?! I think not...
 

bjt2

Senior member
Sep 11, 2016
784
180
86
Doesnt matter to the extent that they mentioned the efficency of the new units in respect of the older ones..



Same power comsumption/cycle as Excavator, to summarize comsumption at same frequency but at twice the FP throughput for Zen is comparable.

At 3.5GHz an XV core consume 8W in Cinebench R15, so 8 Zen core should be within 65W not counting the uncore but this latter will also be much more efficient than in XV derived APUs, so the total figure should be about 80W.

Ah! At last... I didn't remember exact power consumption of an XV core...
BTW the same consumpiton is reasonable even if the FP is twice the size, since XV is on 28nm BULK and Zen on 14nmFF...
 

cdimauro

Member
Sep 14, 2016
163
14
61
Yes, but since the calculation done is the same, the useful switching are the same and if the low level units are the same, the power consumption of the FPU and ALU is the same. What changes is the consumption of the rest of the CPU. Even if the difference is +-50%, the most part of the power consumption is given by the FPU and ALU, at least 2/3. I was aiming at a rough estimation. More of an upper limit.
I know the x86 decoders are much power hungry. But probabily A9x does not have uop cache and probabily Zen's uop cache hit rate is about 80%, that means that the decoders draw 1/5 of the power... This can be sufficient to make their power consumption similar to 4-6 ARM decoder? How much power an x86 decoder draw? 1W?!?!?! I think not...
It's not only a question of decoders. ISA matters ALSO in several other aspects, which nobody talk about unfortunately.

Take some INT or FPU instructions from the respective ISAs, and try to follow what happens in the pipeline AND in the respective unit during all execution cycle.

Just ONE hint: even on x64, the segmentation is still active. Do you know what does it mean? I think no, from what you've said 'til now.

The real world isn't made of just FO4, silicon, and libraries. ISA AND uarchitectures MATTER too.

Another useful exercise to understand it can be writing an x86 and/or an ARM emulator. Then you'll see how much work is needed by an "ALU/FPU" to achieve exactly the same operation...
 
Reactions: KTE

Abwx

Lifer
Apr 2, 2011
11,517
4,303
136
1) Excavator isn't an ARM processor.
2) You're using AMD's slides.

Who need ARM when we have a comparison made by AMD...

And about being an AMD slide, what is exactly the problem, do you pretend that they are lying, if so give us some clues about your exclusives infos..

So far that s the most relevant infos for whom wants to do some estimations.

I prefer to wait for a power consumption analysis from tech sites like AnandTech, to see what really happens... on the real world.

So much but then what is the point to discuss those numbers in a forum to eventually deny them, wait for AT analysis in this case and apply this logic to your arguments as well...

Ah! At last... I didn't remember exact power consumption of an XV core...
BTW the same consumpiton is reasonable even if the FP is twice the size, since XV is on 28nm BULK and Zen on 14nmFF...

That s somewhat a worst case figure since 8W/core is for the Athlon 845 at 3.5GHz, a frequency at wich it has already lost some theorical efficency due to the process starting to degrade above 3GHz.

As for the comsumption being reasonable that s surely due massively to 14nm and to a lesser extent to power management, a delta that big can be brought only by a more efficient process.
 
Last edited:
Status
Not open for further replies.
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |