New Zen microarchitecture details

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

KTE

Senior member
May 26, 2016
478
130
76
I wont waste my time, an ARM core certainly has a pipeline that is less frequency friendly that an Intel CPU, isnt it..



http://forums.anandtech.com/showpost.php?p=38251747&postcount=1484

This was posted numerous times, in case you need more ground for estimations LVT/RVT are not the lowest Vth transistors within 14nm LPP, there s still the sLVT wich should leak more but switch faster.

The numbers for LVT :

http://semiaccurate.com/forums/showpost.php?p=250960&postcount=47
I don't mean to sound rude but I think you are mistaken in your understanding of semiconductor test curves, their application and their limitations.

But still, thanks for the links (I have read them previously). Do note that the switching power is also 2x in comparison there, at far below Tjmax.

Your current projection is a top-end 8C, wide, complex chip with SMT at near 4GHz at less than 10W each core max? That's what I was asking more clarification about.

Sent from HTC 10
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I will say again, unless ZEN core is way smaller and more efficient than Excavator at the same node, having less MT performance than an Excavator Module will be a fail for the server(Gloud/Data Center) market.

Also, if we take 40% IPC increase + 25% from SMT, at the same clocks ZEN would have close to 80% of MT performance of an Excavator Module.

So im thinking that ZEN SMT scaling may be more than what we are expecting or clocks will be higher than current 28nm products. Perhaps could be both ??
There's only that much you can get out of SMT on such a core. If there are enough execution units for say an integer and a FP thread, then the processor will hit the next bottlenecks waiting in a queue, like 2 AGUs, fetch B/W, decode B/W, cache ports, scheduler queue, etc. On SNB, IVB, HSW, BDW, SKL each single threads' performance is more like 60-70% of ST performance, if the core is doing SMT at full steam ahead.

Instead of adding that much logic to make SMT great again, they spent the transistors on cores.

I think, the Zen core vs. XV module calculation would be more like this:
1 Zen core = 1 XV core * 1.4 * 1.25 = 1.75 XV cores
or roughly 1 module with a rather well scaling application.
 

ElFenix

Elite Member
Super Moderator
Mar 20, 2000
102,425
8,388
126
I'm not just referring to this forum, but the interweb at large

i haven't seen espn report about it at all so i don't think the interweb at large cares much about it


i really don't go to other tech boards, so this is my only frame of reference. i don't expect other boards to be vastly different from this one.


that said, i have an idea....
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
I don't have any >= Haswell chips with SMT enabled, so could someone please test following:

Download: https://onedrive.live.com/redir?resid=8329B08E8413A80E!553&authkey=!AIU4D0mRdco08M8&ithint=file%2czip

Extract & run (test.bat).

The test will run three times, write down the "IPS" figure from each run (from the window which remains). Proceed to the next phase by pressing any key, after the second window has shut down (second and third run) and the main window says "Done".

This is a standard single threaded FP test, based on Euler CFD program from Rodinia test suite. I would like to see how SMT behaves on more recent Intel designs.

It takes less than five minutes to complete, and the only requirement is that the clocks stay static and there is no excess stuff running on threads 0 & 1.
 

Abwx

Lifer
Apr 2, 2011
11,172
3,868
136
I don't mean to sound rude but I think you are mistaken in your understanding of semiconductor test curves, their application and their limitations.

But still, thanks for the links (I have read them previously). Do note that the switching power is also 2x in comparison there, at far below Tjmax.

Your current projection is a top-end 8C, wide, complex chip with SMT at near 4GHz at less than 10W each core max? That's what I was asking more clarification about.

Sent from HTC 10

I dont think i m mistaken at all, quite the contrary, i think that you didnt catch the meaning of thoses numbers....

Switching power is the same because that s a test at isopower, you will notice that 14nm LPP frequency is considerably higher.

Same power but 2.06x the frequency (of 28nm HPP) , at same frequency power would be 3.54x smaller, that s for LVT devices.
 

DrMrLordX

Lifer
Apr 27, 2000
21,813
11,167
136
I'm sad that you set your expectations to unreasonably high levels

Why is it unreasonable? See AtenRa below.

Isn't it more that lads are saying "I want it to be faster than Intel'sBrightestAndBestestK CPU", rather than hyping it up that it's going to be around that level o' performance?

No, it's about AMD eclipsing their own old designs. It's like 6c Thuban beating 3m Piledriver in . . . well . . . anything. Which, ya know, happened. Sadly.

I will say again, unless ZEN core is way smaller and more efficient than Excavator at the same node, having less MT performance than an Excavator Module will be a fail for the server(Gloud/Data Center) market.

Ding ding ding, winnar. Especially when you consider the fact that XV was on shelves last year.
 

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
Yeah, the latest Intel chips have usually consumed less power at full (non-AVX) load than what their TDPs are rated at, but AVX2 is really a power hog.

The quad-channel RAM bus can't help TDP either. And it's only needed for the really big server chips; 8-core and below can usually do fine on dual-channel. Zen (at least the Summit Ridge incarnation) will be dual-channel, so that should help improve TDP compared to Intel's offerings a bit.
 

KTE

Senior member
May 26, 2016
478
130
76
.
I think, the Zen core vs. XV module calculation would be more like this:
1 Zen core = 1 XV core * 1.4 * 1.25 = 1.75 XV cores
or roughly 1 module with a rather well scaling application.
1.75x XV

Is that the the max or min performance you're expecting?

Also how does that factor in a) the contention hit of SMT with 2 or more cores used b) possible LLC bottlenecks in MT?

I don't have any >= Haswell chips with SMT enabled, so could someone please test following:
I have IVB and SKL (and more) laptops but no access till Monday.

I dont think i m mistaken at all, quite the contrary, i think that you didnt catch the meaning of thoses numbers....

Switching power is the same because that s a test at isopower, you will notice that 14nm LPP frequency is considerably higher.

Same power but 2.06x the frequency (of 28nm HPP) , at same frequency power would be 3.54x smaller, that s for LVT devices.

So how do these curves apply to Zen?

Sent from HTC 10
 

inf64

Diamond Member
Mar 11, 2011
3,765
4,223
136
1.75x XV

Is that the the max or min performance you're expecting?

Also how does that factor in a) the contention hit of SMT with 2 or more cores used b) possible LLC bottlenecks in MT?


I have IVB and SKL (and more) laptops but no access till Monday.



So how do these curves apply to Zen?

Sent from HTC 10

Also keep in mind that XV module is not scaling as well as traditional dual core with 2 threads(more like 1.7x scaling with second core).
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
1.75x XV

Is that the the max or min performance you're expecting?

Also how does that factor in a) the contention hit of SMT with 2 or more cores used b) possible LLC bottlenecks in MT?
This was just a theoretical comparison of Zen+SMT vs. 1 single XV core at iso frequency. The result means roughly equal IPC with SMT/CMT. I think that Zen clocks will be lower than BR's, but power per core too. So in this core vs. module comparison I'd rather say, this is more to the max expectable average performance. YMMV of course with every workload (some might really shine due to XV bottlenecks missing in Zen).

XVs didn't have L3, but worse prefetchers and a slower L2 than Zen. As seen in core scaling tests with 2Ch SKL (with L3) there isn't too much of a hit caused by mem B/W. Power limits are more interesting here.
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Fottemberg seem to be posting this image all over the place:



If his point is showing that even early prototypes of Bulldozer didn't hit very high frequencies, someone could tell him that's not the case.

Prototype Model 2820 - 3.6GHz - Rev. A1
Prototype Model 3120 - 4.1GHz - Rev. B0

Bulldozer Rev. B2G ended up shipping at 4.3GHz (FX-4170).

Rev. A1 was the first revision which was complete in terms of features.
The part in the CPU-Z screenshot is Rev. B0 (44 ending).

As far as I know, at silicon level Zeppelin has been already finalized. Because of that I find those rumors about "samples operating at just 2.4GHz" pretty unlikely.
 

stuff_me_good

Senior member
Nov 2, 2013
206
35
91
I don't have any >= Haswell chips with SMT enabled, so could someone please test following:

Download: https://onedrive.live.com/redir?resid=8329B08E8413A80E!553&authkey=!AIU4D0mRdco08M8&ithint=file%2czip

Extract & run (test.bat).

The test will run three times, write down the "IPS" figure from each run (from the window which remains). Proceed to the next phase by pressing any key, after the second window has shut down (second and third run) and the main window says "Done".

This is a standard single threaded FP test, based on Euler CFD program from Rodinia test suite. I would like to see how SMT behaves on more recent Intel designs.

It takes less than five minutes to complete, and the only requirement is that the clocks stay static and there is no excess stuff running on threads 0 & 1.
Hope this helps.


 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Thanks for testing!

The results look quite different than I expected.

The first run tests the performance with only thread 0 (native core) utilized. The second run utilizes both threads 0 and 1 (i.e native core and SMT thread). The third run does exactly the same, however the SMT thread will be utilized twenty seconds after the native core starts executing.

I wanted to see how much utilizing a single SMT capable core fully, penalizes the performance and also if there is a "switching penalty" of any kind. Based on those numbers there is no SMT switching penalty, but the SMT penalty would be 46%

I don't think that can be right. Isn't SMT just supposed to utilize the resources which are unused, i.e the native core has the priority over resources?
 
Last edited:

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
Thanks for testing!

The results look quite different than I expected.

The first run tests the performance with only thread 0 (native core) utilized. The second run utilizes both threads 0 and 1 (i.e native core and SMT thread). The third run does exactly the same, however the SMT thread will be utilized twenty seconds after the native core starts executing.

I wanted to see how much utilizing a single SMT capable core fully, penalizes the performance and also if there is a "switching penalty" of any kind. Based on those numbers there is no SMT switching penalty, but the SMT penalty would be 46%

I don't think that can be right. Isn't SMT just supposed to utilize the resources which are unused, i.e the native core has the priority over resources?

No, the two threads each equally share the execution resources (assuming no pipeline stalls for either thread), so the throughput of a single thread drops when a second thread is scheduled for execution. Overall core utilization and throughput increases.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
So the SMT thread should perform the same as the native core, as long as the native core is not utilized?
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
Source please? IPC was always shown vs EX core (and core vs core not 1C/2T vs 1M/2T). EX core is around 15% faster than PD core on average, bar some +/- corner cases.
So 40% over EX core is around 60% over PD core. Then there is SMT bonus in MT workloads where PD module has 15-20% penalty.

At Computex, and seemingly in a recent investor slide, they've been comparing Zen to what it is replacing which is FX Vishera. Imo, so they can make less nuanced ("Up to", "common workloads") IPC claims like their unambiguous "40% more IPC" Computex slide.
 
Last edited:

Abwx

Lifer
Apr 2, 2011
11,172
3,868
136
At Computex, and seemingly in a recent investor slide, they've been comparing Zen to what it is replacing which is FX Vishera. Imo, so they can make less nuanced ("Up to", "common workloads") IPC claims like their unambiguous "40% more IPC" Computex slide.

Comparison with the FX is for this slide, so SKU vs SKU :



And comparison with XV, so core vs core :


 

KTE

Senior member
May 26, 2016
478
130
76
So the SMT thread should perform the same as the native core, as long as the native core is not utilized?
It is simply a method of running two logical threads on one core with minimal die size or power increase. It is a method to improve EU utilization hence IPC.
Nehalem SMT: http://www.anandtech.com/show/2594/8

If there are resources free and there aren't resource conflicts or stalls in execution for other reasons, the overall performance of the core increases >100%. The second thread may add anything up-to <35%. 27% was max Intel reported the first time around in 2002, I think 34% the second. So it's two logical threads managed thru a dynamic resource sharing scheme.

AMD CPUs run at much lower IPC in general, so the assumption is that presumably there is a bigger gain possible if done well and efficiently. But that's a huge IF (net enthusiasts boasted a lot more with CMT, but look what happened...).

Sent from HTC 10
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Page 13 of the current version of the investor presentation slide says "Over 40% improvement in IPC over current AMD CPU core", just like rest of the slides do (core). Current generation cannot mean anything but 15h family, and all 15h family designs are compute unit based. Since there are 1-4 CUs in the designs which AMD call as 2 - 8 core products, there is exactly zero room left upon interpretation.
 

KTE

Senior member
May 26, 2016
478
130
76
I wanted to see how much utilizing a single SMT capable core fully, penalizes the performance and also if there is a "switching penalty" of any kind. Based on those numbers there is no SMT switching penalty, but the SMT penalty would be 46%
A penalty occurs with some code depending on their instruction mix, but that big only with code like loops avoiding tiered cache/mem designed for absolute throughput measurements.

Edit: Westmere-EP was studied with professional FP workloads: https://www.nas.nasa.gov/assets/pdf/papers/saini_s_impact_hyper_threading_2011.pdf



SMT gain was approx. -7% to 15% for 24 cores, more with higher core counts. The negative results correlated strongly to the amount of vectorization in the code.

Sent from HTC 10
 
Last edited:

Abwx

Lifer
Apr 2, 2011
11,172
3,868
136
Page 13 of the current version of the investor presentation slide says "Over 40% improvement in IPC over current AMD CPU core", just like rest of the slides do (core). Current generation cannot mean anything but 15h family, and all 15h family designs are compute unit based. Since there are 1-4 CUs in the designs which AMD call as 2 - 8 core products, there is exactly zero room left upon interpretation.

Lol, so Piledriver is the "current" AMD core, including APUs..

Current AMD core means the current design, not their 2012 design...

In case you missed the post :

Comparison with the FX is for this slide, so SKU vs SKU :



And comparison with XV, so core vs core :


 
Last edited:

naukkis

Senior member
Jun 5, 2002
782
637
136
So the SMT thread should perform the same as the native core, as long as the native core is not utilized?

Of course. Symmetrical threading. Instruction fetch is from different thread every other cycle, two combined threads will use more resources than single so overall throughput will rise.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Lol, so Piledriver is the "current" AMD core, including APUs..

Current AMD core means the current design, not their 2012 design...

In case you missed the post :

It is, as it is still manufactured and shipped. Besides, for server segment (where Zen is primarily inteded for) Piledriver is the most recent core.
 

Abwx

Lifer
Apr 2, 2011
11,172
3,868
136
It is, as it is still manufactured and shipped. Besides, for server segment (where Zen is primarily inteded for) Piledriver is the most recent core.

It is said Excavator in the slide that mention Zen and said 40%, notice that the comparison is explicitely Zen, wich is a core :





On the other slide it is mentionned Summit Ridge, and that s in comparison to the Vishera based FX :



So the first slide is for IPC in comparison to XV while the second one compare whole CPUs throughput wise.

It s logical that FX is used as basis for plateform comparison since it s the higher throughput SKU in AMD s current line up, and we re talking of high perfs DT parts after all...
 

KTE

Senior member
May 26, 2016
478
130
76
Lol, so Piledriver is the "current" AMD core, including APUs..

Current AMD core means the current design, not their 2012 design...
:
Current core or previous generation means current FX/top-end offering and what the new generation is replacing. Which is PD.

PR talk is segmented carefully. Years don't matter.

"Barcelona is 40% faster, 42%, then 50% faster than Intels best". Sadly what all the hardcore enthusiasts found out after launch was that AMD PR meant at 2.6GHz, versus Clovertown 2.66GHz... which had 3 faster models, and now Harpertown, when Barcelona came to be available at 2.0GHz Those were some sad days...

Sent from HTC 10
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |