New Zen microarchitecture details

Page 115 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Abwx

Lifer
Apr 2, 2011
11,167
3,862
136
What data?
What does say the slide below, if you dont understand what is mentioned then i wonder why you are even trying to talk about CPUs...



Where does it show that Intel chips are undervolted?

At the same site from where the slide originate, with 7% undervolting all you could do was to run Windows but certainly not Prime or IBT, same at Hardware.fr where they didnt even bother to do their usual undervolting test, i guess that it will be the same that the previous gen, that is MBs "unexpectedly overvolting" the CPUs...

40% wouldn't even put it close to Broadwell either if you compare with Excavator.

It depend what you are comparing, in Integer based code it s more than enough while in FP, as pointed by their rendering demo, it s apparently, and logically, significantly more than 40%.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Any advances on this,especially last sentence?
It appeared at Linley Group, but behind a paywall.
http://www.linleygroup.com/index.php


40% on average for ST could never be less than awesome. We're accustomed to 5-20%.

Poses the question how this 40% is impacted with MT accesses in a DT/Mb soft. that scales well to 8 cores.

I mean are we potentially looking at scenarios of P=EXC 1C+20-30%+SMT?

The converse side is, if SMT is done well, we could be looking at superlinear scaling for servers with the bigger L1/L2 in situations where the data starts fitting into the caches (esp scientific).

Also CMT =! SMT for performance compares. 8 BD cores will be compared to these 8 full cores. That's how they are marketed. SMT is an additional extraction from the same cores, rather than a core duplication so it will be included.
Well, AMD's last gen wasn't SNB, but a BD derivative, which barely has an IPC > that of K10, with a L3-like L2 latency, 2.x-wide integer, still shared fetch, etc.

Speaking about that, I think, that with BD AMD actually targeted Intel's SMT scenario. They argued about per-thread performance, which on Intel is ~60-70% that of ST performance. But ST somehow kept to be important.

1C XV +40% +25% for SMT roughly equals 1 XV module. At ~5-6mm².

Anyway for claims regarding IPC, SMT yield, clocks, individual benchmarks: YMMV.

1). 40% ST IPC over XV isn't that great. All the CMT designs have poor ST IPC and make up for it with the second thread. Switching to an SMT design *similar to Intel's* should have changed the power balance so that more of the Zen core's resources could be committed to the "first thread" where necessary. If Keller went with something closer to POWER8's SMT design then maybe we're just seeing the next evolution of Con cores.
Someone said, that XV IPC is above K10 levels now. So we're seeing K10+40%+x.

Does Intel now have a first thread priorization? According to my tests on BDW with Prime95 + another MT software at different priorities, there is no such thing. This is thread communism. Each thread gets the same ressources + what it grabs by accident.

IBM didn't come with SMT-8 as their first SMT implementation. Bigger SMT designs also come with a cost (area, power, complexity, etc.). If we'd have said, that AMD just should take it's time, we'll have to wait until 2018-20.
 
Last edited:

raghu78

Diamond Member
Aug 23, 2012
4,093
1,475
136
Polaris turned up in a competitive product. If someone expected a GP100 cruching chip at 249$ it's their issue. Also, the jump to 14nm had a very positive effect on power, if you compare their previous architecture at 28nm (and there, the process was the same as Nvidia's, so it's clear that the current GCN's chips are not easily power gated as the competitor's and it's not a process fault)

Polaris failed miserably in terms of perf/watt against Pascal. That is very clear from the reviews.. AMD lied that Polaris had 2x the perf/watt of Maxwell when the reality was it struggled to match Maxwell perf/watt. AMD has shown they wilfully misrepresent the facts and competitiveness of their upcoming products. I would advise people to not have any expectations based on AMD marketing claims. In fact I would say the opposite of their claims is likely to be true.
 
Reactions: Phynaz

Glo.

Diamond Member
Apr 25, 2015
5,763
4,667
136
Polaris failed miserably in terms of perf/watt against Pascal. That is very clear from the reviews.. AMD lied that Polaris had 2x the perf/watt of Maxwell when the reality was it struggled to match Maxwell perf/watt. AMD has shown they wilfully misrepresent the facts and competitiveness of their upcoming products. I would advise people to not have any expectations based on AMD marketing claims. In fact I would say the opposite of their claims is likely to be true.
GTX 1060 and RX 480 both have exactly the same performance per watt. 35 GFLOPs/watt. That is how you count efficiency of GPUs. Lets wait and see what Efficiency will Vega have. Don't you think that with HBM2/1 it can have higher efficiency?
Just so I can understand, whats the problem with ZEN 8Core 16Threads 3.2GHz CPU at 95W TDP ??

Intel Core i7 9600K is 8Core 16Threads 3.2GHz base 3.7 Turbo (Single Core) at 140W TDP.

Are we 100% sure this 8Core 16Thread ZEN will only be 95W TDP ???
AMD Wraith Cooler is designed for 125W TDP.

IMO, AMD did quite nice smoke play here, with TDPs .
 

DrMrLordX

Lifer
Apr 27, 2000
21,807
11,161
136
Does Intel now have a first thread priorization?

I wouldn't characterize it that way . . . more like, in "normal" code, they can usually squeeze ~70% of the core's potential throughput out of one thread. Tightly-coded stuff (like Linpack) can get 95-99% of the core's potential throughput out of one thread. Con cores simply can't do that, at all.

So what happens is that if you have 4c/8t Haswell/Broadwell/Skylake you can get maybe 70% potential throughput with 4 threads or 100% throughput with 8 threads thanks to SMT. With Con cores you can get 50-55% potential throughput with 4 threads and 100% throughput with 8 threads.

That's all I was really trying to say.
 

zentan

Member
Jan 23, 2015
177
5
36
GTX 1060 and RX 480 both have exactly the same performance per watt. 35 GFLOPs/watt. That is how you count efficiency of GPUs.
Ok,so we should ignore all the current reviews and performance/efficieny estimation but look at some theoretical peak numbers by the two companies? Probably that's why AMD was so keen on showing off a 86W system power consumption of a polaris against 140W of a 950 equipped system(which we know how it all turned out to be_?
The excuse is just ridiculous.
 

leoneazzurro

Golden Member
Jul 26, 2016
1,010
1,608
136
Polaris failed miserably in terms of perf/watt against Pascal. That is very clear from the reviews.. AMD lied that Polaris had 2x the perf/watt of Maxwell when the reality was it struggled to match Maxwell perf/watt. AMD has shown they wilfully misrepresent the facts and competitiveness of their upcoming products. I would advise people to not have any expectations based on AMD marketing claims. In fact I would say the opposite of their claims is likely to be true.

Polaris is quite fine in performance, especially DX12, and is behind in power, yes, I never said otherwise, I said it has better perf/W than Maxwell, which is true if you compare similar price brackets, while I said it is still much behind perf/w than Pascal, even if the situation is now much better than with the previous generation. Repeating again and again your "opinions" does not make them true. I already presented you facts, it is not my fault you cannot accept them and remain loyal to your own delusions. And it's not that I cannot show you TONS of misleading marketing material from NV and Intel, for example.
 
Last edited:
Reactions: sirmo

Glo.

Diamond Member
Apr 25, 2015
5,763
4,667
136
Ok,so we should ignore all the current reviews and performance/efficieny estimation but look at some theoretical peak numbers by the two companies? Probably that's why AMD was so keen on showing off a 86W system power consumption of a polaris against 140W of a 950 equipped system(which we know how it all turned out to be_?
The excuse is just ridiculous.
In DX11 both GPUs are on par. In DX12 RX 480 jumps out in performance forward equalling the efficiency on both GPUs(higher performance for higher power consumption on RX 480).

Both GPUs have exactly the same efficiency in equal environment(low CPU overhead).
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I wouldn't characterize it that way . . . more like, in "normal" code, they can usually squeeze ~70% of the core's potential throughput out of one thread. Tightly-coded stuff (like Linpack) can get 95-99% of the core's potential throughput out of one thread. Con cores simply can't do that, at all.

So what happens is that if you have 4c/8t Haswell/Broadwell/Skylake you can get maybe 70% potential throughput with 4 threads or 100% throughput with 8 threads thanks to SMT. With Con cores you can get 50-55% potential throughput with 4 threads and 100% throughput with 8 threads.

That's all I was really trying to say.
OK, I understood. "Potential throughput" is the important metric here. Thanks for your clarification.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,866
3,418
136
I see the Derp Derp AMD squad has come in strong over the last day.
Still yet to back up any of their words with anything that matches any of the released or other wise gleaned details. Where are the explicit or implicit weaknesses that are going to limit the core.
All tip and no iceberg........


Trolling, flamebait, and OT are not allowed
Markfw900
 
Last edited by a moderator:

leoneazzurro

Golden Member
Jul 26, 2016
1,010
1,608
136
I see the Derp Derp AMD squad has come in strong over the last day.
Still yet to back up any of their words with anything that matches any of the released or other wise gleaned details. Where are the explicit or implicit weaknesses that are going to limit the core.
All tip and no iceberg........

There will be weaknesses, of course. As I said before, their ST performance, while better than XV, probably is not yet up to par with the Intel offerings and likely to be below Broadwell or at best at that level. How much, it will depend also on clocks and quite probably the release speeds will be lower than the competition - and for that I mean Skylake, not Broadwell-E. This, also because they went straight for an 8 core part. Multithreading and FP - it has to be seen where the performance will be, but it's likely to be competitive, especially compared to Intel's 4 core desktop parts. All depend also on pricing, if they will manage to sell the 8 core parts at or slightly below Intel's 4 core, they'll have a good price/performance ratio for many applications. Of course, they could also repeat the Bulldozer fiasco, but in that case the hints of very low ST performance were quite clear before the actual launch of the products.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
AMD has carefully chosen Blender benchmark to hide any weaknesses versus Broadwell, so only time will tell if there are any bottlenecks or glass jaws ( or even perf disasters due to bugs like original TLB bug in K10 etc ).

As Stilt's testing has shown, Blender load has the following characteristics - limited vectorization, because it does not gain from AVX, and code that scales real well with SMT, meaning it can't extract all perf from single thread due to low code potential IPC (no idea if its due to dependencies or memory). Stuff like Linpack was already mentioned as showcase of ST saturating floating point load, compared to that Blender scales nicely.

So given that ZEN has 4 FP "ports", meaning 2 threads can issue up to 4 ops when stars align correctly ( and due to Blender not using wide vectors, stars do align somewhat here, 256bit vectors would choke load/store ).
Broadwell has just 2 FP "ports", meaning 2 threads can issue max up to 2 ops when stars align correctly ( and again, due to Blender being light loaded, half of each FP port hardware is sitting idle, so 64 byte/32 byte L/S is underutilized).

So the only thing was revealed is literally this - if you are able to use SMT you can extract Broadwell like FP throughput from ZEN, in Blender like FP loads, that do not vectorize well and do not use FMA.
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
Ok,so we should ignore all the current reviews and performance/efficieny estimation but look at some theoretical peak numbers by the two companies? Probably that's why AMD was so keen on showing off a 86W system power consumption of a polaris against 140W of a 950 equipped system(which we know how it all turned out to be_?
The excuse is just ridiculous.

I don't think the Polaris 10 first demo was a 460 configuration. Most likely Apple and other OEMs are getting the best binned chips.

As for Zen, AMD comments about their Blender Broadwell comparison suggest they will try for at least a 3.2GHz base clock for highest SKU.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
So given that ZEN has 4 FP "ports", meaning 2 threads can issue up to 4 ops when stars align correctly ( and due to Blender not using wide vectors, stars do align somewhat here, 256bit vectors would choke load/store ).
Broadwell has just 2 FP "ports", meaning 2 threads can issue max up to 2 ops when stars align correctly ( and again, due to Blender being light loaded, half of each FP port hardware is sitting idle, so 64 byte/32 byte L/S is underutilized).
Most important ports are the first two;
HSW/BDW;
Port 0: FMA, Multiply(5-HSW/3-BDW), Divide, Packed Add, Packed Multiply(5)
Port 1: FMA, Add(3)/Multiply(5-HSW/3-BDW), Packed Add
SKL;
Port 0: FMA, Multiply/Add(4-SKL), Divide/Square Root, Packed Add, Packed Multiply(5)
Port 1: FMA, Multiply/Add(4-SKL), Packed Add, Packed Multiply(5)

Port 5: Vector Permutes and x87, Packed Add

3 128-bit Integer Adds, 1 128-bit Integer Multiply -> HSW/BDW
3 128-bit Integer Adds, 2 128-bit Integer Multiply -> SKL
2 128-bit Multiply, 1 128-bit Add -> HSW/BDW
2 128-bit Multiply, 2 128-bit Add -> SKL

or

3 256-bit Integer Adds, 1 256-bit Integer Multiply -> HSW/BDW
3 256-bit Integer Adds, 2 256-bit Integer Multiply -> SKL
2 256-bit Multiply, 1 256-bit Add -> HSW/BDW
2 256-bit Multiply, 2 256-bit Add -> SKL

Most likely, AMD's Blender didn't allow for 256-bit.
 
Last edited:

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
Most likely, AMD's Blender didn't allow for 256-bit.

I always assumed that. Seems a sensible choice as those who have need for 256bit crunching should look elsewhere anyway. That was a given from the start.

The issue is if the blender test is representive of a more usual usage pattern for desktop users. Whats your take here?
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
I always assumed that. Seems a sensible choice as those who have need for 256bit crunching should look elsewhere anyway. That was a given from the start.

The issue is if the blender test is representive of a more usual usage pattern for desktop users. Whats your take here?
I went looking. Blender supports AVX/AVX2 & FMA 256-bit instructions...

GCC;
set(CYCLES_AVX2_KERNEL_FLAGS "-ffast-math -msse -msse2 -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mfma -mlzcnt -mbmi -mbmi2 -mf16c -mfpmath=sse") <mfpmath=sse means xmm/ymm is preferred over x87.>/depreciated with 64-bit and -maxv ((
GCC depresses SSEx instructions when -mavx is used. Instead, it generates new AVX instructions or AVX equivalence for all SSEx instructions when needed.

These options enable GCC to use these extended instructions in generated code, even without -mfpmath=sse. Applications that perform run-time CPU detection must compile separate files for each supported architecture, using the appropriate flags. In particular, the file containing the CPU detection code should be compiled without these options)) There is no -mprefer-avx128, so AVX256 will be selected.

Clang;
set(CYCLES_AVX2_KERNEL_FLAGS "-ffast-math -msse -msse2 -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mfma -mlzcnt -mbmi -mbmi2 -mf16c")

MSVC;
set(CYCLES_AVX2_ARCH_FLAGS "/arch:AVX /arch:AVX2")
set(CYCLES_AVX2_KERNEL_FLAGS "${CYCLES_AVX2_ARCH_FLAGS} /fp:fast -D_CRT_SECURE_NO_WARNINGS /GS-")

So, AMD must have done something. I do not think that particular test of Blender can be representative for desktop.
 
Last edited:

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
I went looking. Blender supports AVX/AVX2 & FMA 256-bit instructions...

GCC;
set(CYCLES_AVX2_KERNEL_FLAGS "-ffast-math -msse -msse2 -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mfma -mlzcnt -mbmi -mbmi2 -mf16c -mfpmath=sse") <mfpmath=sse means xmm/ymm is preferred over x87.>/depreciated with 64-bit and -maxv ((
GCC depresses SSEx instructions when -mavx is used. Instead, it generates new AVX instructions or AVX equivalence for all SSEx instructions when needed.

These options enable GCC to use these extended instructions in generated code, even without -mfpmath=sse. Applications that perform run-time CPU detection must compile separate files for each supported architecture, using the appropriate flags. In particular, the file containing the CPU detection code should be compiled without these options)) There is no -mprefer-avx128, so AVX256 will be selected.

Clang;
set(CYCLES_AVX2_KERNEL_FLAGS "-ffast-math -msse -msse2 -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mfma -mlzcnt -mbmi -mbmi2 -mf16c")

MSVC;
set(CYCLES_AVX2_ARCH_FLAGS "/arch:AVX /arch:AVX2")
set(CYCLES_AVX2_KERNEL_FLAGS "${CYCLES_AVX2_ARCH_FLAGS} /fp:fast -D_CRT_SECURE_NO_WARNINGS /GS-")

So, AMD must have done something. I do not think that particular test of Blender can be representative for desktop.

Its way over my head. But what do you think they have done to avoid avx256 then?
 

leoneazzurro

Golden Member
Jul 26, 2016
1,010
1,608
136
That they avoided 256 bit is a speculation, only AMD knows. Zen seems able to execute AVX 256 bit code, only probably every two cycles.
 

Abwx

Lifer
Apr 2, 2011
11,167
3,862
136
So, AMD must have done something. I do not think that particular test of Blender can be representative for desktop.

Despite 10% CMT penalty an Athlon 845 3.5GHz can match a 3.6GHz HW i3 in Blender , so what they did was to design a SMT core that has roughly the same throughput as an EXV module..



As said ad nauseam Zen has more FP exe ressources than a whole EXV module, so the number is not even surprising, neither is the unwillingness of some people to aknowledge this fact, hence the theories about what AMD could have done or not done, lol...
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |