New Zen microarchitecture details

Page 169 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
So, bottom line is Skylake has 2x the listed throughout for the two instructions mentioned? (and that is approx 2x Zen throughput.)

Throughput is unknown for FMA 256bit for Zen, but it should be half of that listed + Latency is also unknown, if 128bit FMA is listed as 5 cycle lat, is there any overhead for teaming up the units?
For FP division (and sqrt), Intel's advantage is even bigger, i'd expect 3x throughput advantage in 128bit ops ( 4 vs 12.5 as reported by Stilt/Canard) and probably more in 256bit mode ( 8 vs at least 25+ ).

Importance of those in real world is not big. FMA is nice for Linpacks etc, but for division/sqrt you need some specialty code, like very tight loop of vectorized instructions, where having lower throughput would hurt. Hard to imagine if you ask me.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
3 times slower then bulldozer and 14 times less throughput? Ouch that doesn't seem right. Is that load to use latency or just execution latency?

?
Bulldozer and Piledriver have 27 cycle latency for both 128 and 256-bit VDIVPD, while Steamroller has 33 cycles.
13/15 > 27/33?
 
Reactions: Drazick

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Is there a complete instruction latency & throughput listing for SKL/KBL somewhere to be found?
http://users.atw.hu/instlatx64/GenuineIntel00906E9_Kabylake_InstLatX64.txt
http://users.atw.hu/instlatx64/GenuineIntel00506E3_Skylake_InstLatX64.txt

I read the two articles of GOto... In the second it's not said if the four pipelines can do 4 FADD. I thought that they were 2 FMUL and 2 FADD to be combined for FMAC. But it seems, according to Goto, that the FMUL is actually a full blown FMAC and that it borrows only the bus from the FADD, impeding the FADD at the same time. I wonder if the FMUL pipe can do also a FADD, since actually is a FMAC pipe...
I covered that in my P3DNow! article, as I understood it the same way regarding FMAC, port borrowing. I even assumed back then (dunno if pointed out in the article), that they might have separate FMUL subunits for lower power (than full blown FMACs). I assume, FADD could only be done indicrectly I think by explicitly using FMA ops.

Anyway the VFMADD132PD i found that can be both 128 and 256 bit, so i think that measured throughput was 2 because was 128 bit...
See:
then the Bristol ridge would have twice the throughput of what is listed.
Also my thought on this.

VFMADD132PD is prolly operating on two doubles instead of four, so thoughput is 2.
Intels divisors are stuff of the legend, i think by Broadwell they had 10bit wide divisors, so their thoughput is like 2-3x fo what AMD has. Again, VDIVPD results from Canard make zero sense for Zen, probably two doubles instead of four are processed here. ( or DIVPD thoughput reported too low ).
That's not a matter of throughput but latency, which is in line (even higher than SR). Throughput could be gained by more efficient pipelining - allowing for some overlapping of FDIV execution, or simply adding more DIV units (or both of course).

And? It is obviuos from DIV results, that twice that many bits halve the throughput. Intel results are correct for 256bit and for Skylake.
http://instlatx64.atw.hu/

What is obviuosly wrong: AMD ZEN results. 128bit results are presented and compared with 256bit results from Skylake.
We can only ask Canard PC here or wait for some AIDA instlat dump as long as nobody else got his hands on an ES.
 

Nothingness

Diamond Member
Jul 3, 2013
3,063
2,047
136
Strangely in AVX2 mode the VDIVPD in Zen is almost as fast as kabylake. So if in AVX2, they are almost as fast, if in SSE mode, DIV and SQRT are slower, but Zen has twice the pipelines... Indeed with standard Blender (that is 128 bit, probabily SSE2), Zen has similare IPC than BWE...
I find it very odd that DIV is that much slower in SSE than in AVX.
 

Jan Olšan

Senior member
Jan 12, 2017
400
689
136
6900K use Intel s stock cooler.

As for Ryzen how could it be overcloscked s since it will have higher base frequency than 3.4GHz..?.

And btw, its TDP is displayed by AMD in the demos, and we know that the chip is slightly overvolted, so that s yet another mistake of yours...



Nothing is overclocked, and we know it, it s just the same usual suspects repeating ad nauseam what perhaps suit their agenda, i mean, whoever did read a little know that base frequencies will be higher than 3.4, so whoever come with the "overclocking argument" is just aknowledging that he s willfully trying to mislead the eventual unsuspecting readers...

What we know that *retail chips* will be over 3.4 GHz (even 3.6 it seems now). That doesn't mean that they used one like that for the demo in december. If you recall, the turbo has been disabled and clock was fixed at 3.4 GHz. Best explanation for such configuration is that AMD used an ES with lower clock and OCed it to 3.4 GHz for purpose of that demo. How else do you really end up with such config? By having buggy/non-working turbo? I sure hope that is not the case.

While they seemed to have those 3.6/3.9 ES chips on CES, the New Horizon demo took place 3 weeks earlier. It is more than plausible that they used lower-clocked ES with overclock then.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
That's not a matter of throughput but latency, which is in line (even higher than SR). Throughput could be gained by more efficient pipelining - allowing for some overlapping of FDIV execution, or simply adding more DIV units (or both of course).

I am aware of that. But DIVPD is listed as 13 / 13. So 128 bit SSE2 division has throughput of 13 cycles. VDIVPD is listed as having 9 cycle TP??? Stilt has claimed 128 bit mode throughput 12.5, that is more of what I expect it to be ( equal or tiny bit better than DIVPD due to instruction decode differences ). So Intel has 4 cycle throughput, AMD has ~12 cycle throughput in 128bit mode.

In VDIVPD 256bit mode, Intel has 8 cycle throughput, AMD has 14.3 cycle TP ( according to Stilt ), which is very nice if true.

Still 256bit versus 128bit has strange differences (probably due to sensitivity of division operation speed to bits set in operands, it is usually quite variable and hard to test). If AMD has 2 divisor pipes, why they are not showing in 128 bit division throughput?
 

bjt2

Senior member
Sep 11, 2016
784
180
86

Thanks.

I covered that in my P3DNow! article, as I understood it the same way regarding FMAC, port borrowing. I even assumed back then (dunno if pointed out in the article), that they might have separate FMUL subunits for lower power (than full blown FMACs). I assume, FADD could only be done indicrectly I think by explicitly using FMA ops.

Could you please give the link?

See:

Also my thought on this.

Bristol ridge have 2 128 bit FMAC pipes, so throughput of 128 bit VFMADD132PD should be 2 and 1 for the 256 bit
Zen have 4 128 bit pipes, so should have same throughput than bristol ridge in both cases

That's not a matter of throughput but latency, which is in line (even higher than SR). Throughput could be gained by more efficient pipelining - allowing for some overlapping of FDIV execution, or simply adding more DIV units (or both of course).

If the throughput of the DIV is 1/latency, lowering latency increases also the throughput, because you should wait less to issue another instruction...
 

Abwx

Lifer
Apr 2, 2011
11,540
4,325
136
Best explanation for such configuration is that AMD used an ES with lower clock and OCed it to 3.4 GHz for purpose of that demo. How else do you really end up with such config? By having buggy/non-working turbo? I sure hope that is not the case.

While they seemed to have those 3.6/3.9 ES chips on CES, the New Horizon demo took place 3 weeks earlier. It is more than plausible that they used lower-clocked ES with overclock then.


The chip was barely consuming 74W during the demo, hardly what one would call an overclocked chip, dunno from where you are pulling such non sensical theories...

Btw, you are right that they did over set something, though, namely the chip supply voltage, otherwise instead of 74W it would had consumed something like 65-67W, as said at 3.4GHz....
 
Reactions: Doom2pro

Jan Olšan

Senior member
Jan 12, 2017
400
689
136
The chip was barely consuming 74W during the demo, hardly what one would call an overclocked chip, dunno from where you are pulling such non sensical theories...

Btw, you are right that they did over set something, though, namely the chip supply voltage, otherwise instead of 74W it would had consumed something like 65-67W, as said at 3.4GHz....

We have only seen idle-to-load deltas for both chips IIRC, so you can't really say what the "CPU consumption" was from that. What if Ryzen had high idle power draw because of early/ill-tuned BIOS and motherboard? I don't remember the specifics, but the reported idle power was relatively high, while you would normally expect the idle power of AM4 to be equally or more frugal than FM2+ (There was a discrete GPU, but those shouldn't have dramatic idle consumption today).

If it was a measurement of the load on CPU's 12V rail, it would be different.

And if my theory is nonsensical, why do you think they disabled turbo, then
 

bjt2

Senior member
Sep 11, 2016
784
180
86
We have only seen idle-to-load deltas for both chips IIRC, so you can't really say what the "CPU consumption" was from that. What if Ryzen had high idle power draw because of early/ill-tuned BIOS and motherboard? I don't remember the specifics, but the reported idle power was relatively high, while you would normally expect the idle power of AM4 to be equally or more frugal than FM2+ (There was a discrete GPU, but those shouldn't have dramatic idle consumption today).

If it was a measurement of the load on CPU's 12V rail, it would be different.

And if my theory is nonsensical, why do you think they disabled turbo, then

If the idle is, say 20W, then when they fix the BIOS and other things, we should expect even higher clocks (+400?), then, because sane idle power is around few watts...
 

Abwx

Lifer
Apr 2, 2011
11,540
4,325
136
We have only seen idle-to-load deltas for both chips IIRC, so you can't really say what the "CPU consumption" was from that.

Of course that we can, and with a very small error margin...


What if Ryzen had high idle power draw because of early/ill-tuned BIOS and motherboard? I don't remember the specifics, but the reported idle power was relatively high, while you would normally expect the idle power of AM4 to be equally or more frugal than FM2+ (There was a discrete GPU, but those shouldn't have dramatic idle consumption today).

Because idle power included the monitors, that s as simple as that, these are full system consumptions wich are measured, because one logically use a PC with a monitor i would think...


If it was a measurement of the load on CPU's 12V rail, it would be different.

The only difference is the PSU losses wich are about 10%, add the VRMs losses, about 10% as well since these are MBs with very good CPU supplies.

Edit : According to Hardware.fr the 6900K plateform use 63W at idle, the CPU idle comsumption is 21W measured at the 12V rail, wich translate by roughly 18W for the CPU, this correlate with Zen wich is said to have in the range of 5W idle power, wich would make for the 13W difference between the two plateforms at idle.

And if my theory is nonsensical, why do you think they disabled turbo, then

Perahps because the final firmware is still not implemented in demo PCs, but for exposing the perf it was better to have a fixed frequency chip, otherwise the turbo would had made it impossible to corner the perf window accurately.
 
Last edited:
Reactions: Doom2pro

Jan Olšan

Senior member
Jan 12, 2017
400
689
136
Because idle power included the monitors, that s as simple as that, these are full system consumptions wich are measured, because one logically use a PC with a monitor i would think...

That would be the first time I have seen LCD included in such measurement. It is rather stupid because it only clouds the measurement further and it is not like the LCD's power cable is connected to the PC, like you could do it in 1980s.

And no, you certainly can't draw conclusions with "small error margin" when the difference in idle power with the same CPU on various motherboards of the same platform is routinely say 5-15 W. And in load, there is the varying efficiency of the VRMs, so the error margin is even bigger.
I'd say these numbers are so uncertain that the error margin may be bigger than the actual difference between Summit Ridge and Broadwell-E. And to calm you down, I don't want to spread doubts about Zen, it is just that I can't accept these numbers as conclusive for anything (good or bad).

Edit : According to Hardware.fr the 6900K plateform use 63W at idle, the CPU idle comsumption is 21W measured at the 12V rail, wich translate by roughly 18W for the CPU, this correlate with Zen wich is said to have in the range of 5W idle power, wich would make for the 13W difference between the two plateforms at idle..
You can't take anecdotic data from one CPU on one board as universal truth about the CPU and the platform.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
We can estimate Zen consumption from the delta idle/load, reduced by VRMs and PSU efficiency, adding a sane idle power for a 14nm low power chip... 94*.9*.9=76W, plus 5W, max 81W of total power consumption. And consider that VRMs efficiency is under 88% (not 90%) and PSU efficiency is 90% only for gold90+ certified PSUs and only at 50% load: if the PSU is e.g. 700W, to cope with a medium GPU, with less than 200W of consumption the efficiency is well below 90%... Moreover 5W is an overestimation of idle power. Usually is 2-3W.
In conclusion the power probabily is in the ballpark of 75W, for an overvolted ES at 3.4GHz... I would not call that an overclocked chip...
 

KTE

Senior member
May 26, 2016
478
130
76
We can estimate Zen consumption from the delta idle/load, reduced by VRMs and PSU efficiency, adding a sane idle power for a 14nm low power chip... 94*.9*.9=76W, plus 5W, max 81W of total power consumption. And consider that VRMs efficiency is under 88% (not 90%) and PSU efficiency is 90% only for gold90+ certified PSUs and only at 50% load: if the PSU is e.g. 700W, to cope with a medium GPU, with less than 200W of consumption the efficiency is well below 90%... Moreover 5W is an overestimation of idle power. Usually is 2-3W.
In conclusion the power probabily is in the ballpark of 75W, for an overvolted ES at 3.4GHz... I would not call that an overclocked chip...
81W under 75% Max load... Nice estimate.

Real world vs guessing is an interesting study. We used this methodology to arrive at 90-100W for Agena back in the day. Then you hooked up the ammeter and saw no less than 128-140W. Good lesson.


Sent from HTC 10
(Opinions are own)
 

bjt2

Senior member
Sep 11, 2016
784
180
86
81W under 75% Max load... Nice estimate.

Sent from HTC 10
(Opinions are own)
How could you say it's 75%? Link? Or it's a guess? And 81W it's a theoretical limit. Probabily it's under 75W...
 

bjt2

Senior member
Sep 11, 2016
784
180
86
81W under 75% Max load... Nice estimate.

Real world vs guessing is an interesting study. We used this methodology to arrive at 90-100W for Agena back in the day. Then you hooked up the ammeter and saw no less than 128-140W. Good lesson.


Sent from HTC 10
(Opinions are own)

The ammeter on what? If it's the wall it's obvious... There are the VRMs and PSU losses...
 
Reactions: Doom2pro

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I am aware of that. But DIVPD is listed as 13 / 13. So 128 bit SSE2 division has throughput of 13 cycles. VDIVPD is listed as having 9 cycle TP??? Stilt has claimed 128 bit mode throughput 12.5, that is more of what I expect it to be ( equal or tiny bit better than DIVPD due to instruction decode differences ). So Intel has 4 cycle throughput, AMD has ~12 cycle throughput in 128bit mode.

In VDIVPD 256bit mode, Intel has 8 cycle throughput, AMD has 14.3 cycle TP ( according to Stilt ), which is very nice if true.

Still 256bit versus 128bit has strange differences (probably due to sensitivity of division operation speed to bits set in operands, it is usually quite variable and hard to test). If AMD has 2 divisor pipes, why they are not showing in 128 bit division throughput?
AIDA64 tests both special and generic divisors. To understand the causes for these differences we likely need to do our own microbenchmarking. So far there might also be a difference due to 2-operand and 3-operand encodings and the available FPRF regs during throughput measurement loops with such long latency instructions, as one would need to create many independent dependency chains.

Could you please give the link?
Of course. Here's the translation link for the part about the FPU:
https://translate.google.com/transl...die-fliesskomma-einheit-im-detail/&edit-text=

It lists only one author (me ), but a lot has been done by P3DNow! user/writer "Opteron" (S940).

Bristol ridge have 2 128 bit FMAC pipes, so throughput of 128 bit VFMADD132PD should be 2 and 1 for the 256 bit
Zen have 4 128 bit pipes, so should have same throughput than bristol ridge in both cases
It should. That's why it's a strange result.

If the throughput of the DIV is 1/latency, lowering latency increases also the throughput, because you should wait less to issue another instruction...
Yeah, that's true. Here I was focusing on JoeRambo's point.

BTW we should not forget that average FP real world software isn't stuffed with SQRT/DIV instructions, so that latency becomes more important than t'put.
 

bjt2

Senior member
Sep 11, 2016
784
180
86

Thank you, i will read ASAP (maybe not now... I must wake up in 7 hours and I am tired...)

BTW we should not forget that average FP real world software isn't stuffed with SQRT/DIV instructions, so that latency becomes more important than t'put.

This could be the reason for this huge SMT gains in blender: while waiting the FDIV, the other thread kicks in... SMT 4 could be not a bad idea, especially for Zen that has 4 FP pipes and with separate scheduler...
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,920
3,547
136
?
Bulldozer and Piledriver have 27 cycle latency for both 128 and 256-bit VDIVPD, while Steamroller has 33 cycles.
13/15 > 27/33?
See according to me im the center of the universe thus you must have been talking about what i was talking about. Thus i assumed VFMADD132PD. Out of interest do you have the VFMADD132PD 256bit latency/ TP numbers? same as CPC's?

cheers
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |