New Zen microarchitecture details

Page 123 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
It isn't exactly rocket science comparing the performance characteristics of different CPUs. If your workloads of interest are mostly rendering, then you'll want a CPU which performs well in floating point benchmarks. In the other hand, if you mostly do A/V encoding then you'll want a CPU which does well in integer workloads. FP or INT performance isn't application specific in general. For example AMD 15h CPUs never perform well in floating point, no matter what application you are using.
 

cdimauro

Member
Sep 14, 2016
163
14
61
And how do you plan to misure such workloads? With synthetic benchmarks or real-world applications?

The kernel of the dispute is entirely here.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Ideally with both.
Personally I prefer to use as much as open-source software and compilers as possible.
 

cdimauro

Member
Sep 14, 2016
163
14
61
I don't know you, but when I run queries against a database, I compile a project, I run an emulator, I compress a folder, etc., the last thing which I want to know is about the result of a synthetic benchmark, because it doesn't affect my real life. For professionals it's even worse: time = money.

And it has nothing to do with open source software, compilers included. As I said, if I need the most from a CPU, I can even spend a lot of money for a good compiler that helps me.

A good compiler can be even open source. And nobody stops people, or the CPU vendor, to contribute to such projects to help their favorite uarchitecture.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
For which reasons? Are people interested on knowing abstract numbers about different hardwares, or about how perform the applications that they daily use?

So which CPU is used for the real application does not matter right? The only thing is performance? So since all CPU are Turing complete, one can use x86, Power, ARM or other architectures, provided that the software run correctly?
And then one can measure performance and power consumption and decide which is the best.
 

cdimauro

Member
Sep 14, 2016
163
14
61
Performance, power consumption, or both: it depends on the specific need.

You choose the applications that you need AND buy the CPU which satisfy your requirements.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
And as I said, it isn't rocket science to find the right hardware for your needs based on the performance in the usual benchmarks.
If you can't do that yourself, then pay someone for it or wait someone to test the performance of the specfic workload on the specific hardware.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
It depends. If you use more precision than the canonical 8 bits for color components, the 24 bit precision offered by single precision's mantissa might not be enough to handle the massive calculations of heavy algorithms like ray tracing.
I just checked. POVRay uses double precision ("DBL"), while Blender has "const float col[3]" and lots of floats anywhere else. Some doubles too, of course.
 

leoneazzurro

Golden Member
Jul 26, 2016
1,051
1,711
136
Let me ask a couple of questions about these "256 bit" AVX supposed deficiencies of Zen. AFAIK, Zen has 4x128 bit FP SIMD units, latest Intel arch has 2x256 bit. But, Zen has the possibility to execute 256 bit FP instructions, too, by splitting up the "larger" instructions between two clock cycles. So, theoretically, having double the units, this leads to the same theoretical peak FP256 rate than BW, albeit at the cost of probably higher latencies. When executing FP128, Zen should have double of the peak rate of Broadwell, per clock (latencies apart). Also, I've read on several sites (sorry, too lazy to start finding the occurencies), some can be found directly on Intel's site ( http://www.intel.com/content/dam/ww...on-e5-v3-advanced-vector-extensions-paper.pdf ) it seems that when executing AVX2 core the maximum frequencies are reduced in order to keep down the power consumption.
So, it seems to me that if we look only at the peak rate per clock, Zen has the upper hand in FP128 and parity in FP256.
While, of course, when looking for real performance in FP128/256, we should know:
- Real latencies of the execution (influenced by memory/cache, schedulers, etc..)
- Actual clock speeds when executing these instructions (that may be lower than maximum turbo also in Zen)

Am I correct or there is something else?
 

bjt2

Senior member
Sep 11, 2016
784
180
86
I just checked. POVRay uses double precision ("DBL"), while Blender has "const float col[3]" and lots of floats anywhere else. Some doubles too, of course.
Then how come POVRay go better on AMD CPUs if they are 128 bit only? Execution port limitation?
 

bjt2

Senior member
Sep 11, 2016
784
180
86
Let me ask a couple of questions about these "256 bit" AVX supposed deficiencies of Zen. AFAIK, Zen has 4x128 bit FP SIMD units, latest Intel arch has 2x256 bit. But, Zen has the possibility to execute 256 bit FP instructions, too, by splitting up the "larger" instructions between two clock cycles. So, theoretically, having double the units, this leads to the same theoretical peak FP256 rate than BW, albeit at the cost of probably higher latencies. When executing FP128, Zen should have double of the peak rate of Broadwell, per clock (latencies apart). Also, I've read on several sites (sorry, too lazy to start finding the occurencies), some can be found directly on Intel's site ( http://www.intel.com/content/dam/ww...on-e5-v3-advanced-vector-extensions-paper.pdf ) it seems that when executing AVX2 core the maximum frequencies are reduced in order to keep down the power consumption.
So, it seems to me that if we look only at the peak rate per clock, Zen has the upper hand in FP128 and parity in FP256.
While, of course, when looking for real performance in FP128/256, we should know:
- Real latencies of the execution (influenced by memory/cache, schedulers, etc..)
- Actual clock speeds when executing these instructions (that may be lower than maximum turbo also in Zen)

Am I correct or there is something else?

Yes, AMD could have more sprint on legacy 128 bit code, either for the 4 ports, and the separate FP/INT schedulers. But INTEL can do 2x256 bit FMAC. In AMD the four pipes are 2 FADD and 2 FMUL. So it can do 1 256 bit FADD and 1 256 bit FMUL at the same time, but INTEL can do 2 256 bit FADD or 2 256 bit FMUL or 2 256 bit FMAC. So the peak rate of INTEL, using only FMACS is double of AMD's.

Also having more units and more ports, AMD can shine in legacy SMT code... INTEL also has 4 L/S ports and can do 4 256 bit memory operations, versus 3x128 of AMD.

In short, INTEL can shine in memory intensive 256 bit code or 256 bit code with many FMACS.
All the others should favor AMD. This is one of the reasons for Blender results. And i bet that also in POVRay and cinebench will shine...
 
Reactions: kraatus77

itsmydamnation

Platinum Member
Feb 6, 2011
2,923
3,550
136
I dont really agree with most of what you have written.

Yes, AMD could have more sprint on legacy 128 bit code, either for the 4 ports, and the separate FP/INT schedulers. But INTEL can do 2x256 bit FMAC. In AMD the four pipes are 2 FADD and 2 FMUL. So it can do 1 256 bit FADD and 1 256 bit FMUL at the same time, but INTEL can do 2 256 bit FADD or 2 256 bit FMUL or 2 256 bit FMAC. So the peak rate of INTEL, using only FMACS is double of AMD's.
AMD issue 256bit ops back to back not "ganging together" so because the FPU is fully pipelined a 256bit op takes 1 cycle longer then a 128bit op. so whats more accurate to say is that to execute 2x256 op's amd takes 1 cycle longer then 128bit ops. So it then becomes about how much pressure there is on the FPU execution pipeline as to whether that extra FMAC execution width makes a difference in execution, generally it isn't as SIMD floating point tends to be very load store heavy.

In the Case of FMA ( just guessing) is that
1. 1st 128bit gets ADD then forwarded to MUL.
2. 1st 128 gets MUL/round 2nd 128bit gets ADD, 1st forward to FPRF.
3. 1st 128buts gets returned to FPRF 2nd 128bit get MUL/round, 2nd forwarded to FPRF
4. 2nd 128bit gets returned to FPRF


Also having more units and more ports, AMD can shine in legacy SMT code... INTEL also has 4 L/S ports and can do 4 256 bit memory operations, versus 3x128 of AMD.

Your understanding of Intel and AMD architectures are wrong, AMD dont have any load or store ports like intel does, every port can forward to the retirement buffer. AMD only has two AGU's vs Intels 3 AGU's, But AMD have implemented a stack engine that is supposed to releave Load/store pressure from the AGU's which should also reduce power costs.
In short, INTEL can shine in memory intensive 256 bit code or 256 bit code with many FMACS.
All the others should favor AMD. This is one of the reasons for Blender results. And i bet that also in POVRay and cinebench will shine...

What you have missed and what really makes the difference is the load store and cache widths.
Zen: 256/128 load store to L1D
Zen: 32B L1D to L2
Zen 32B L2 to L3

>haswell: 512/256 load store to L1D
>haswell: 64B L1D to L2
>haswell 64B L2 to L3

This is what actually will make Intel much faster at 256bit ops then AMD regardless of FMAC or not, go have a look at floating point workloads they have a very high percentage of loads and store compared to Execution that is where intel has the advantage not in the 1 cycle advantage it gets from the extra width in the execution stage.
 

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
Papermaster needs some "bite" imo. Its always kind of right but boring. One can say Raja compensates but its surely different profiles. And thats good. But imo a cto in this business needs to be more upfront and more jhh like. Even if its a risk.

Edit. No he dont need a super ego. Dress in a tshirt and lie when suited to this ego. But less is a value because it drives things forward. This relentlessness.
 
Last edited:

bjt2

Senior member
Sep 11, 2016
784
180
86
I dont really agree with most of what you have written.


AMD issue 256bit ops back to back not "ganging together" so because the FPU is fully pipelined a 256bit op takes 1 cycle longer then a 128bit op. so whats more accurate to say is that to execute 2x256 op's amd takes 1 cycle longer then 128bit ops. So it then becomes about how much pressure there is on the FPU execution pipeline as to whether that extra FMAC execution width makes a difference in execution, generally it isn't as SIMD floating point tends to be very load store heavy.

In the Case of FMA ( just guessing) is that
1. 1st 128bit gets ADD then forwarded to MUL.
2. 1st 128 gets MUL/round 2nd 128bit gets ADD, 1st forward to FPRF.
3. 1st 128buts gets returned to FPRF 2nd 128bit get MUL/round, 2nd forwarded to FPRF
4. 2nd 128bit gets returned to FPRF

Ganging or back to back, anyway Zen FPU throughput on 256 bit FMAC is HALF of Skylake's. If they were 4 128 FMAC pipelines, they were on par.

Your understanding of Intel and AMD architectures are wrong, AMD dont have any load or store ports like intel does, every port can forward to the retirement buffer. AMD only has two AGU's vs Intels 3 AGU's, But AMD have implemented a stack engine that is supposed to releave Load/store pressure from the AGU's which should also reduce power costs.

I know of the AGU number and the fact that are decoupled from the L and S units (the load and store queues stands for this) and also of the stack memfile (a sort of L0 cache), but this latter is for at most 64 bit accesses. 128 and 256 bit accesses must go in L and S queues. Since AMD can do 2x128 READ and 1x128 WRITE per clock and the AGU are only 2, I suppose that for simple addressing (e.g. MOV AX,[BX]) they don't use AGU, otherwise you can't do 2 read and 1 write per clock. I know also of the 4 memory pipeline with attached AGU, on intel CPU (i am not sure, but i remember 1 L, 2 L/S and 1 S), that are 256 bit


What you have missed and what really makes the difference is the load store and cache widths.
Zen: 256/128 load store to L1D
Zen: 32B L1D to L2
Zen 32B L2 to L3

>haswell: 512/256 load store to L1D
>haswell: 64B L1D to L2
>haswell 64B L2 to L3

This is what actually will make Intel much faster at 256bit ops then AMD regardless of FMAC or not, go have a look at floating point workloads they have a very high percentage of loads and store compared to Execution that is where intel has the advantage not in the 1 cycle advantage it gets from the extra width in the execution stage.

AMD's L1<->L2, L2<->L3 and L3<->NB buses are 32B read PLUS 32B write (see the AMD slides), a total of 64 bit. 256+128 for L1 is correct.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,923
3,550
136
Ganging or back to back, anyway Zen FPU throughput on 256 bit FMAC is HALF of Skylake's. If they were 4 128 FMAC pipelines, they were on par.
Yes and that width only costs a single extra cycle, thats why i said for that width to make a difference you need to have lots of pressure on the execution pipes, if there are any gaps the 128bit pipes will catch back up.


I know of the AGU number and the fact that are decoupled from the L and S units (the load and store queues stands for this)
Then why say something completely contradictory?

and also of the stack memfile (a sort of L0 cache), but this latter is for at most 64 bit accesses. 128 and 256 bit accesses must go in L and S queues. Since AMD can do 2x128 READ and 1x128 WRITE per clock and the AGU are only 2, I suppose that for simple addressing (e.g. MOV AX,[BX]) they don't use AGU, otherwise you can't do 2 read and 1 write per clock. I know also of the 4 memory pipeline with attached AGU, on intel CPU (i am not sure, but i remember 1 L, 2 L/S and 1 S), that are 256 bit
So remember your still going to have a stack to process regardless of what is being executed, then also consider your compiler will prioritise loads ahead of stores. So again you will need to have lots of execution with no bubbles to be an issue, but we know this doesn't happen very often for sustains periods otherwise we wouldn't see SMT uplift in any heavy FP code but we do.


AMD's L1<->L2, L2<->L3 and L3<->NB buses are 32B read PLUS 32B write (see the AMD slides), a total of 64 bit. 256+128 for L1 is correct.
I listened to the hotchips presentation (20:50 seconds) and it is explicitly stated there is 32b total from L3 to L2 to L1 here is the diagram http://images.anandtech.com/doci/10591/HC28.AMD.Mike Clark.final-page-013.jpg?_ga=1.232415519.1001491487.1458767430

Then look at http://www.realworldtech.com/haswell-cpu/5/ and you will see the caches are described the same way. Why would you have more L2/L3 bandwidth then L1 to core, caches are bandwidth amplifiers you need less bandwidth the further you move away, the power costs go up the further you move away.
 
Last edited:

bjt2

Senior member
Sep 11, 2016
784
180
86
I listened to the hotchips presentation (20:50 seconds) and it is explicitly stated there is 32b total from L3 to L2 to L1 here is the diagram http://images.anandtech.com/doci/10591/HC28.AMD.Mike Clark.final-page-013.jpg?_ga=1.232415519.1001491487.1458767430

Then look at http://www.realworldtech.com/haswell-cpu/5/ and you will see the caches are described the same way. Why would you have more L2/L3 bandwidth then L1 to core, caches are bandwidth amplifiers you need less bandwidth the further you move away, the power costs go up the further you move away.

They used bad symbols in their slides... If the bus is 32 Bytes bidirectional (and so half duplex) they should have used the symbol <--------->, like in this image, for Bulldozer:


I found this image with hard searching and now this confirms me that you are right: AMD stated that they doubled the bandwidth. According to this image, Bulldozer bus was 128 bit (16 Bytes) bidirectional and half duplex. Double is 32 Bytes bidirectional. But using 2 unidirectional buses as symbol, makes confusion...

EDIT: this is regarding the bus versus the NB... As you can see the L1<->L2 buses is 2x128 unidirectional buses. Again here AMD states doubling the bw, so L1<->L2 buses should be 2x32Bytes.

P.S.: this is an academic paper, so I think is correct...

EDIT2: regarding bandwidth amplifications: the caches must serve also prefetch requests, so a BW equal or greater than upper layers is not unuseful...
 

KTE

Senior member
May 26, 2016
478
130
76
"We had a power-optimized set of processors for the low end. We had a very high-performance set of processors for the mid- and high-end ranges. In Zen, we wanted a new and modern core in every respect, meaning it can handle a range of workloads."


With all due respect, in 2012, there was no such thing from AMD.

Sent from HTC 10
(Opinions are own)
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
"We had a power-optimized set of processors for the low end. We had a very high-performance set of processors for the mid- and high-end ranges. In Zen, we wanted a new and modern core in every respect, meaning it can handle a range of workloads."


With all due respect, in 2012, there was no such thing from AMD.

Sent from HTC 10
(Opinions are own)
Did he speak about server t'put or CB ST performance? And did he compare that to a competitor or just some older internal product? Did he mention it as classes or did he try to apply to mindsets of perfectionistic enthusiasts at technical forums?
 
Reactions: bjt2

Abwx

Lifer
Apr 2, 2011
11,543
4,327
136
Did he speak about server t'put or CB ST performance? And did he compare that to a competitor or just some older internal product? Did he mention it as classes or did he try to apply to mindsets of perfectionistic enthusiasts at technical forums?

Apparently the key word is versatility, so that encompass all the above :

http://www.pcworld.com/article/3109...competition-back-to-high-performance-x86.html

What you’re seeing with Zen is its versatility. What is Cinebench trying to represent, or the benchmark that we showed today? They show off—we show off a number of multithreaded applications. And you saw that we’ve done a true simultaneous multithreaded implementation. That really helps double the effective cores for those applications.

But we did it—and I mentioned this in the presentation—by increasing the resources in that execution pipeline. So if you are running single-thread, you get the benefit of these additional resources. It’s a versatile core; it’s going to play well to single-threaded and multithreaded applications.
 

cdimauro

Member
Sep 14, 2016
163
14
61
I just checked. POVRay uses double precision ("DBL"), while Blender has "const float col[3]" and lots of floats anywhere else. Some doubles too, of course.
Blender is fine with 8-bit color components, however a float isn't enough for 16-bit components. An FP32 has only 24 bits (23 in reality: the first bit is implicit and always set to 1), leaving only a few (7) extra for dealing with the calculations.
 

KTE

Senior member
May 26, 2016
478
130
76
Did he speak about server t'put or CB ST performance? And did he compare that to a competitor or just some older internal product? Did he mention it as classes or did he try to apply to mindsets of perfectionistic enthusiasts at technical forums?
You know my hobby since 17 is trading performance vehicles...




Wish washy ambiguous language is the mark of a snake oil salesman

I am pretty sure no one in the IT community thought of BD as very high performance. Tech forums or corporate boffins.


Zen, however, has a major chance to capture huge marketshare in the biggest corporate driver today: big data analytics, data warehousing and cloud. It needs power, and all the top datacentres are after consolidation at the lowest power possible. Good ST is always beneficial, but ST isn't as important here. Scaling and throughput is paramount tho. VMs are licensed by the Core+2GB mem in these segments. It's far easier to consolidate with high core counts, and just manage VMs, as well as leave extra for Standby power (higher charge).

Whereas with lower core counts, this is not possible.

Sent from HTC 10
(Opinions are own)
 
Reactions: Dresdenboy

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
Gf announce 7nm
Pressrelease
http://www.digitimes.com/news/a20160919PR202.html
ramping h1 2018
Key paths ready for using euv

"Globalfoundries' new 7nm FinFET technology is expected to deliver more than twice the logic density and a 30% performance boost compared to today's 16/14nm foundry FinFET offerings, the company claimed"

Edit: ramp to risk production early 2018
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,923
3,550
136
Zen, however, has a major chance to capture huge marketshare in the biggest corporate driver today: big data analytics, data warehousing and cloud. It needs power, and all the top datacentres are after consolidation at the lowest power possible. Good ST is always beneficial, but ST isn't as important here. Scaling and throughput is paramount tho. VMs are licensed by the Core+2GB mem in these segments. It's far easier to consolidate with high core counts, and just manage VMs, as well as leave extra for Standby power (higher charge).

Whereas with lower core counts, this is not possible.

Sent from HTC 10
(Opinions are own)

Just to add to this ( my job is designing data centre infrastructure) Zen has some really big opportunities because of 16 mem channels across 2P. This is because 32/64 GB LR-DIMM's are around 20/60 percent more expensive per GB then 16gGB RDIMM's. Your typical Enterprise VM farm is memory capacity, memory throughput and IO constrained long before CPU throughput constrained.

Most VM farms use middle of the road E5 Xeons ( depending which generation 10 to 16 cores etc) and even then you can normally get pretty aggressive in terms of CPU over subscription on CPU.

So when you look at a server as a platform for VM farms and its total cost Zen should be able to be significantly cheaper when you target something like 512GB memory(or 1024GB) 18-24 cores while being able to have on average a higher VM density because you have more memory bandwidth.

My big hope is that the next gen SOC after Zen (rumored to be 48 cores) keeps the same mem channel to core ratio ( so 12 controllers a P) so go from 4x8 to 6x8 chip configuration but i think most likely is the CCX grows to 6 cores. The reason i hope that AMD keeps pushing memory channels is intel isn't going to stay at 6 per P for forever.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |