New Zen microarchitecture details

Page 168 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
Actually 3.6/4.0 and with 45W less of TDP, and it's still an ES (so subject to improvements)...

The answer to your question is FO4.

AMD with Excavator reached 4.3GHz turbo on the 28nm BULK. With 4.9GHz of OC.

Only with Kabylake on the 14nm INTEL have bested it. Obviously the 28nm BULK is not magical. It's Excavator (and probabily also Zen) that has low FO4...
Isn't Steamroller a better overclocker than Excavator, mainly because it's using a performance library rather than a density library? I believe Anandtech covered this in an article at one point. What I wonder about both, though, is why their DRAM latency is so high, according to The Stilt, in comparison with PD.
 
Reactions: Dresdenboy

DrMrLordX

Lifer
Apr 27, 2000
21,808
11,163
136
SR was only a better overclocker compared to Carrizo. Bristol Ridge has hit 4.8/4.9 GHz in some test runs on boards that you can't really buy, at least not in North America anyway (Asus' "Octopus" for example).

Kaveri/Godavari is consistently only good for 4.7 GHz, with rare exceptions.
 

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
Regardless, the FX-51 results are unsurprising.
That was the chip I was thinking of. I just didn't remember that it actually beat Intel's stuff in memory performance. How long did that lead last?
SR was only a better overclocker compared to Carrizo. Bristol Ridge has hit 4.8/4.9 GHz in some test runs on boards that you can't really buy, at least not in North America anyway (Asus' "Octopus" for example). Kaveri/Godavari is consistently only good for 4.7 GHz, with rare exceptions.
The Anandtech article is from well before BR.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,867
3,418
136
That was the chip I was thinking of. I just didn't remember that it actually beat Intel's stuff in memory performance. How long did that lead last?

The Anandtech article is from well before BR.
Until nehalem but they rolled with large L2 caches on C2D's that helped alot.
 

StrangerGuy

Diamond Member
May 9, 2004
8,443
124
106
Until nehalem but they rolled with large L2 caches on C2D's that helped alot.

Plus the lack of OoO loads which was the Achilles Heel for the K8, and whatever latency improvements the IMC brought over C2D it was more than compensated by the C2D prefetchers.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,867
3,418
136
Plus the lack of OoO loads which was the Achilles Heel for the K8, and whatever latency improvements the IMC brought over C2D it was more than compensated by the C2D prefetchers.

There were plenty of imitations on STARS ( the entire integer execution section for example). The great irony is pretty much all of those were fixed in bulldozer to a level around Sandy bridge. They just then went and messed everything else up......
 

bjt2

Senior member
Sep 11, 2016
784
180
86
Who said that i was talking of VecInt set apart you..?..

One FPMUL in port 0 and one (FPMUL or a FPADD) in the port 1.

At the bottom of the page :

http://www.hardware.fr/news/29-08-2016/



It boost at 3.5 with all cores in about any condition, including Prime 95 if the FFT is not too small :

http://www.hardware.fr/articles/946-4/overclocking-consommation.html
Ok, i checked on INTEL manual and it seems that each pipe can do mul, add and fmac, 256 bit. But they are 2 pipes. On 128 bit/legacy code AMD has advantages.
And on 256 bit code, only 2 pipelines remains for other instructions, for 2 threads (mov, cmp, branch, inc/dec for loops, lea for addresses, etc). Zen, instead has separate int and fp pipes and can ever do 4 int PLUS 4 fp...
 

bjt2

Senior member
Sep 11, 2016
784
180
86
Interestingly Canard PC measured SR's VFMADD132PD throughput to be 2x BR's throughput, or the same as KL (2/cycle).
SQRTPD gets executed on just 1 unit, DIVPD, too, while VDIVPD has a bit better throughput.

Edit:
I just found out about two new articles by Hiroshige Goto (auto translated to Googlish):
The integer unit of AMD's next generation CPU "ZEN" is completely different from the Bulldozer series
Floating point / SIMD unit of AMD next generation CPU "ZEN" of relatively mature design

I read the two articles of GOto... In the second it's not said if the four pipelines can do 4 FADD. I thought that they were 2 FMUL and 2 FADD to be combined for FMAC. But it seems, according to Goto, that the FMUL is actually a full blown FMAC and that it borrows only the bus from the FADD, impeding the FADD at the same time. I wonder if the FMUL pipe can do also a FADD, since actually is a FMAC pipe...

Anyway the VFMADD132PD i found that can be both 128 and 256 bit, so i think that measured throughput was 2 because was 128 bit...
 

bjt2

Senior member
Sep 11, 2016
784
180
86
Throughput of DIV and SQRT are worse on AMD than on Kaby Lake, especially on SSE, as you can see in the picture linked in Dresdenboy post.

Strangely in AVX2 mode the VDIVPD in Zen is almost as fast as kabylake. So if in AVX2, they are almost as fast, if in SSE mode, DIV and SQRT are slower, but Zen has twice the pipelines... Indeed with standard Blender (that is 128 bit, probabily SSE2), Zen has similare IPC than BWE...
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Interestingly Canard PC measured SR's VFMADD132PD throughput to be 2x BR's throughput, or the same as KL (2/cycle).
SQRTPD gets executed on just 1 unit, DIVPD, too, while VDIVPD has a bit better throughput.

VFMADD132PD is prolly operating on two doubles instead of four, so thoughput is 2.
Intels divisors are stuff of the legend, i think by Broadwell they had 10bit wide divisors, so their thoughput is like 2-3x fo what AMD has. Again, VDIVPD results from Canard make zero sense for Zen, probably two doubles instead of four are processed here. ( or DIVPD thoughput reported too low ).
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
128-bit, same instruction as CPC listed.

Sad, cause they listed 256bit VDIVPD thoughput for Skylake, a bit worthless comparison.

EDIT: kinda strange when industry veterans like Canard PC, make such mistakes in instruction table. Both FMA and VDIVPD somehow list 128 bit OPs for AMD and 256 bit OPs lat/TP for Intel. A strange way to paint more rosy picture than it is.
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,867
3,418
136
Sad, cause they listed 256bit VDIVPD thoughput for Skylake, a bit worthless comparison.

EDIT: kinda strange when industry veterans like Canard PC, make such mistakes in instruction table. Both FMA and VDIVPD somehow list 128 bit OPs for AMD and 256 bit OPs lat/TP for Intel. A strange way to paint more rosy picture than it is.

How many avx ops does there have to be in an instruction mix to fire up the 256bit paths for Haswell? Depending on the tests and workload those values could be valid.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
How many avx ops does there have to be in an instruction mix to fire up the 256bit paths for Haswell? Depending on the tests and workload those values could be valid.

They are correctly listed as 8 cycle TP, 256bit, if they were 128bit, their thoughput would be 4 cycles, so your theory does not apply. If 256bit path was not active, thoughput would be even less. It's Zen results that don't make sense for 256bit versus 128bit, clearly both ops are best case 128bit.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
VFMADD132PD is prolly operating on two doubles instead of four, so thoughput is 2.
Intels divisors are stuff of the legend, i think by Broadwell they had 10bit wide divisors, so their thoughput is like 2-3x fo what AMD has. Again, VDIVPD results from Canard make zero sense for Zen, probably two doubles instead of four are processed here. ( or DIVPD thoughput reported too low ).

AFAIK INTEL arch have 256 bit units only on FADD, FMUL and FMAC. Other FP ops are limited at 128 bits...
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,867
3,418
136
They are correctly listed as 8 cycle TP, 256bit, if they were 128bit, their thoughput would be 4 cycles, so your theory does not apply. If 256bit path was not active, thoughput would be even less. It's Zen results that don't make sense for 256bit versus 128bit, clearly both ops are best case 128bit.

If you look at VFMADD132PD for bristol ridge it matches anger for latency and reciprocal throughput. Its also listed as 2 uops so its a 256bit op. If you look at Summit ridge it has twice the throughput of Bristol Ridge which doesn't make sense because they both have two FMA "units". So i went back and looked/listened to the hotchips presentation and nowhere does it list or say the pipes are 128bit or 256bit only that it has 128bit stores. So i have had a quick look and i cant find anywhere AMD says the SIMD pipes are 128bit.

I dont think its likely that CPC did the Bristol ridge and kaby lake tests right but Summit Ridge wrong. Maybe they do have 256bit units but are store limited so unless your tests keep lots of data in registers you only see the throughput improvement on long latency/ low throughput instructions?


edit: it would also make sense from a "SenseMI" perspective, i have wondered about some of the comments AMD reps have made around it. They would only make sense if you clock/power gated resources.
 
Last edited:

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
AFAIK INTEL arch have 256 bit units only on FADD, FMUL and FMAC. Other FP ops are limited at 128 bits...

And? It is obviuos from DIV results, that twice that many bits halve the throughput. Intel results are correct for 256bit and for Skylake.
http://instlatx64.atw.hu/

What is obviuosly wrong: AMD ZEN results. 128bit results are presented and compared with 256bit results from Skylake.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
If you look at Summit ridge it has twice the throughput of Bristol Ridge which doesn't make sense because they both have two FMA "units". So i went back and looked/listened to the hotchips presentation and nowhere does it list or say the pipes are 128bit or 256bit only that it has 128bit stores. So i have had a quick look and i cant find anywhere AMD says the SIMD pipes are 128bit.
{/QUOTE]

They say in their slides. "2 Floating point units x 128 FMACs." Unless "128" in their parlance means 256bits, they have 128bit FMACs
 

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
Sad, cause they listed 256bit VDIVPD thoughput for Skylake, a bit worthless comparison.

EDIT: kinda strange when industry veterans like Canard PC, make such mistakes in instruction table. Both FMA and VDIVPD somehow list 128 bit OPs for AMD and 256 bit OPs lat/TP for Intel. A strange way to paint more rosy picture than it is.

So, bottom line is Skylake has 2x the listed throughout for the two instructions mentioned? (and that is approx 2x Zen throughput.)
 

bjt2

Senior member
Sep 11, 2016
784
180
86
If you look at VFMADD132PD for bristol ridge it matches anger for latency and reciprocal throughput. Its also listed as 2 uops so its a 256bit op. If you look at Summit ridge it has twice the throughput of Bristol Ridge which doesn't make sense because they both have two FMA "units". So i went back and looked/listened to the hotchips presentation and nowhere does it list or say the pipes are 128bit or 256bit only that it has 128bit stores. So i have had a quick look and i cant find anywhere AMD says the SIMD pipes are 128bit.

I dont think its likely that CPC did the Bristol ridge and kaby lake tests right but Summit Ridge wrong. Maybe they do have 256bit units but are store limited so unless your tests keep lots of data in registers you only see the throughput improvement on long latency/ low throughput instructions?


edit: it would also make sense from a "SenseMI" perspective, i have wondered about some of the comments AMD reps have made around it. They would only make sense if you clock/power gated resources.

I found that VFMADD132PD opcode apply both to 128 bit and 256 bit versions. So the throughput can be referred to 128 bit version...
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |