New Zen microarchitecture details

Page 127 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,228
136
So Zen FO4 is lower?

EDIT: for clarity, i supposed same FO4 because assumed worst case of 20 stages for BD...
The FO4 count is usually whatever is needed to target a certain clock rate or power consumption target on a specific node. PDSOI/FDSOI has lower inverter delay than Bulk/FinFETs so the FO4 count can be completely different for the same architecture.

Bulldozer is 15 stages Integer. The branch predictor is decoupled which means its mispredict penalty doesn't match the architectures pipeline stages. It handles both cores, it doesn't cause bubbles in fetch, decode, execution, or load/store. The bubble only impacts the branch predict stage.

Decoupled Branch Predictors that do not match length of pipeline(all the below tend to be longer than pipeline);
Bulldozer(all 15h)
Skylake
Zen
 
Last edited:

Abwx

Lifer
Apr 2, 2011
11,514
4,301
136
The FO4 count is usually whatever is needed to target a certain clock rate or power consumption target on a specific node.

It s not whatever is needed, it s a characteristic of a process, if the FO4 delay of Piledriver is short it is due to the SOI process characteristics and about nothing else.

FTR the FO4 delay is dependent of the transistors switching capacitance and their transconductance, to simplify more capacitance increase the delay while more transconductance reduce this delay since the max frequency is the resultant of gm/C, the first term is the transconductance and the second one is the capacitance, reciprocaly the delay is the resultant of C/gm.
 
Last edited:

CentroX

Senior member
Apr 3, 2016
351
152
116
Without zen intel wouldnt rush their kaby lake CPUs. I bet that AMD has atleast saved us 6 months by trying to compete with intel. Imagine if zen wadnt coming out. We would see lazy ass intel for another few years.
 
Mar 10, 2006
11,715
2,012
126
Without zen intel wouldnt rush their kaby lake CPUs. I bet that AMD has atleast saved us 6 months by trying to compete with intel. Imagine if zen wadnt coming out. We would see lazy ass intel for another few years.

Intel pushed out Kaby Lake from Q4 2016 to CES. Also, PC OEMs like ~1yr product cycles so they can refresh their systems; Intel's schedule with Kaby Lake has essentially nothing to do with Zen.
 

gammaray

Senior member
Jul 30, 2006
859
17
81
Without zen intel wouldnt rush their kaby lake CPUs. I bet that AMD has atleast saved us 6 months by trying to compete with intel. Imagine if zen wadnt coming out. We would see lazy ass intel for another few years.

roll eyes
 

bjt2

Senior member
Sep 11, 2016
784
180
86
At this point they know of course at wich frequency it will be clocked, and likely that it will be in the same ballpark as their current designs, for whom is interested in real info rather than fudistic theories, mainly brought by a given public out of fear that AMD could rival Intel in their strongholds, AMD explicitely stated that 14nm allow higher frequencies at much lower power, for relative frequency as well as for absolute one, in one word 14nm clock higher than 28nm at lower power..



This was released with the other Zen slides when they did their Blender demo, so the graph is related to their CPUs frequencies and efficencies, we dont even need a graduated vertical and horizontal scales, the curves are explicit enough..

I tried to demonstrate, in various ways (Apple A9, FO4, NEON FPU), that Zen will not be clocked at 3GHz... But since BDW-E is 3.2GHz, many think IMPOSSIBLE that the tiny AMD can do better, forgetting that AMD, on the 28nm BULK, with an 8 core, ALREADY went even further...
 

KTE

Senior member
May 26, 2016
478
130
76
Abwx: You're using an ambiguous marketing slide without figures as evidence for frequency gain at the same power.

That slide could be in the 2GHz range... Or the 1GHz.

Can you show?

Are you saying that FO4 is a meaningless/unuseful metric?
I am talking of approximate frequency. I don't want to know with 100MHz precision.
These considerations tells us that at same transitors count, n Zen cores should clock at least as m BD cores, at same power.
Moreover official AMD statement is that (core to core I presume) the energy/cycle consumed is the same. So an 8c Zen should clock at 95W at the same clock of an hypotethical 8c XV at 95W... And so on...
Even if you think FO4 is the same or similar (it's isn't, it's worse for clocks, as BD was a speed demon design. Speed demon = low IPC x high clocks):

XV 3.5GHz 8C hits 130W - best possible case - based on the top DT model (in reality it is more).

At lower width, caches and work done per clock.

Hamstring that to 35W less, we're at 3.1-3.2GHz. Again, best case.

At the same process, new far beefier arch/width/exe/caches puts 30-40W higher usage. More work done requires more power. We're at 130W ish again.

Now you work the process magic. 30% efficiency gain, and you're at ~91W, or 95W rounded up.

That's what you'd be looking at, best possible case for frequency. Reality is worse.

Sent from HTC 10
(Opinions are own)
 

bjt2

Senior member
Sep 11, 2016
784
180
86
Abwx: You're using an ambiguous marketing slide without figures as evidence for frequency gain at the same power.

That slide could be in the 2GHz range... Or the 1GHz.

Can you show?


Even if you think FO4 is the same or similar (it's isn't, it's worse for clocks, as BD was a speed demon design. Speed demon = low IPC x high clocks):

XV 3.5GHz 8C hits 130W - best possible case - based on the top DT model (in reality it is more).

At lower width, caches and work done per clock.

Hamstring that to 35W less, we're at 3.1-3.2GHz. Again, best case.

At the same process, new far beefier arch/width/exe/caches puts 30-40W higher usage. More work done requires more power. We're at 130W ish again.

Now you work the process magic. 30% efficiency gain, and you're at ~91W, or 95W rounded up.

That's what you'd be looking at, best possible case for frequency. Reality is worse.

Sent from HTC 10
(Opinions are own)

BD has 15 stage pipeline, Zen 19. So probabily Zen FO4 is lower than BD's. But anyway, A12 9800 has 3.8 base and 4.2 turbo in 65W with 4 cores plus GPU. So 8 zen core plus 1024 CU should draw 130W. Discard GPU, consider the 28nm->14nm thing, and see that you have exaggerated your power estimation...
 

bjt2

Senior member
Sep 11, 2016
784
180
86
There is no XV 8C at 28nm

We are calculating the TDP of an hypotetical 8c XV without GPU...

I have more data.

A12 9800 has 512 SP at over 1GHz.
AMD R9 380X GPU has 2048 SP at 1GHz with 190W TDP. So 512SP at 1GHz should draw roughly 45W. So for 4 XV cores at 3.8-4.2GHz are left 25W... Then an 8c XV without GPU at 3.8-4.2GHz is highly feasible in 65W, let alone 95W. AMD declared same energy/cycle for Zen core at 14nm versus XV core at 28nm, so a 4GHz 8c Zen core @95W is feasible...
 

KTE

Senior member
May 26, 2016
478
130
76
BD has 15 stage pipeline, Zen 19. So probabily Zen FO4 is lower than BD's. But anyway, A12 9800 has 3.8 base and 4.2 turbo in 65W with 4 cores plus GPU. So 8 zen core plus 1024 CU should draw 130W. Discard GPU, consider the 28nm->14nm thing, and see that you have exaggerated your power estimation...
Pipeline is uarch (and it differs for the Instr) and FO4 is process.

Very seperate matters.

As already shown, you are in other words using BD at 28nm, extrapolating it to 14nm, then positively correlating it to mean Zen frequency and power.

A major process and uarch change just do not scale that like.

Just to be clear, what frequencies are you expecting for Zen?

Sent from HTC 10
(Opinions are own)
 

coercitiv

Diamond Member
Jan 24, 2014
6,593
13,908
136
BD has 15 stage pipeline, Zen 19. So probabily Zen FO4 is lower than BD's. But anyway, A12 9800 has 3.8 base and 4.2 turbo in 65W with 4 cores plus GPU. So 8 zen core plus 1024 CU should draw 130W. Discard GPU, consider the 28nm->14nm thing, and see that you have exaggerated your power estimation...
A12 9800 is has 2 XV modules. The equivalent of Zen 8C would be 8 XV modules.

Doing some ballpark napkin math, if 2 XV modules use around 50W to work at frequencies higher than 4Ghz and around 40W to work in a more efficient zone at just bellow 4Ghz, then this hypothetical 28nm equivalent would roughly use 160-200W. If you take this rough estimate and shrink it by 30% to account for 14nm transition, XV equivalent of Zen would use 110-140W.

This is where 2 more factors come in:
1. Zen has more processing resources, hence uses more power per clock. I'll take KTEs numbers and add 30-40W more, although since it's an absolute value it's not clear to me whether this ballpark value was considered relative to 28nm or 14nm.
2. Zen offers significantly more performance per clock, hence can be clocked significantly lower and still come out on top performance wise. This point is critical, because lowering frequency by 10-20% in a high performance product can easily drop power usage by as much as 20-40%.

So, time for even more dirty napkin math:
Best case scenario: (110W+30W)x0.6 = 84W
Worst case: (140W+40W)x0.8 = 144W

Average is 114W for a product with operating frequency somewhere between 3.2Ghz and 3.6Ghz.

Napkin math checks out, there's an operating frequency somewhere above 3Ghz where our fictional product will hit 95W TDP.

However, as The Stilt and others mentioned several times before in this thread, it all comes down to the 14nm process used by AMD being able to clock significantly higher than 3Ghz and still maintain it's advantage over the best 28nm has to offer. The billion dollar question is where does this advantage start to fall off, and this is where Zen may or may not run out of gas.
 
Last edited:

bjt2

Senior member
Sep 11, 2016
784
180
86
Pipeline is uarch (and it differs for the Instr) and FO4 is process.

Very seperate matters.

As already shown, you are in other words using BD at 28nm, extrapolating it to 14nm, then positively correlating it to mean Zen frequency and power.

A major process and uarch change just do not scale that like.

Just to be clear, what frequencies are you expecting for Zen?

Sent from HTC 10
(Opinions are own)

Are you saying that a 19 stage pipeline CPU on the SAME ISA has an higher FO4 of a 15 stage pipeline architecture?

FO4 delay in ns is the process.

I am talking of relative FO4 delay.

For instance it's estimated a 17 FO4 for each BD stage, namely 17 time a FO4 delay.
And BD on 28nm BULK tops at 4.2-4.3GHz, with 3.8-4.1 base clock (depending on the TDPs).

Relative FO4 is architecture related.

Are you saying that a 19 stage x86 architecture has an higher FO4 delay of a 15 stage x86 architecture of the same manufacturer?

Anyway the absolute FO4 delay of a 14nm FinFet process is probabily lower than that of a 28nm BULK...

I am thinking of 4GHz@95W, at least at slightly mature process. Maybe not the first batches...
 

bjt2

Senior member
Sep 11, 2016
784
180
86
A12 9800 is has 2 XV modules. The equivalent of Zen 8C would be 8 XV modules.

Doing some ballpark napkin math, if 2 XV modules use around 50W to work at frequencies higher than 4Ghz and around 40W to work in a more efficient zone are just bellow 4Ghz, then the 28nm equivalent of 8C Zen would roughly use 160-200W. If you take this rough estimate and shrink it by 30% to account for 14nm transition, XV equivalent of Zen would use 110-140W.

This is where 2 more factors come in:
1. Zen has more processing resources, hence uses more power per clock. I'll take KTEs numbers and add 30-40W more, although since it's an absolute value it's not clear to me whether this ballpark value was considered relative to 28nm or 14nm.
2. Zen offers significantly more performance per clock, hence can be clocked significantly lower and still come out on top performance wise. This point is critical, because lowering frequency by 10-20% in a high performance product can easily drop power usage by as much as 20-40%.

So, time for even more dirty napkin math:
Best case scenario: (110W+30W)x0.6 = 84W
Worst case: (140W+40W)x0.8 = 144W

Average is 114W for a product with operating frequency somewhere between 3.2Ghz and 3.6Ghz.

Napkin math checks out, there's an operating frequency somewhere above 3Ghz where our fictional product will hit 95W TDP.

However, as The Stilt and others mentioned several times before in this thread, it all comes down to the 14nm process used by AMD being able to clock significantly higher than 3Ghz and still maintain it's advantage over the best 28nm has to offer. The billion dollar question is where does this advantage start to fall off, and this is where Zen may or may not run out of gas.

AMD stated that Zen has same energy for clock than XV core, at same clock. Probabily core to core. Let me justify this bold statement:

Zen should have same or lower relative FO4 delay, because has more stages than BD and given that 14nm FO4 should be lower, the Vcore needed for the same frequency should be lower.

Probabily 1.1V vs 1.35V. That is a -20% vcore and -40% power for the same number of transistors. If the current is also inferior, due to lower capacitance (and this could be the case), then we go well above -50%... And indeed GF declarations are of up -65% of power.

Polaris gives us a +15% clock with a -30/-40% power reduction.

Even if a Zen core is like 1 whole XV module (and this is not the case, see below) in terms of transistors, we are on 14nm now and we have less than half the power, so 8 Zen core at 14nm should draw less than 4 XV modules, namely 8 cores. And so the AMD math is justified...

Anyway 1 Zen core has 4 decoders, 4 alu, 2 agu, 4 fpu, 512 KB L2.
1 XV module has 8 decoders, 4 alu, 4 agu, 4 fpu, 2048 KB L2.

I doubt that the uop cache and other enhancement occupy all this much...
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
A12 9800 is has 2 XV modules. The equivalent of Zen 8C would be 8 XV modules.

Doing some ballpark napkin math, if 2 XV modules use around 50W to work at frequencies higher than 4Ghz and around 40W to work in a more efficient zone are just bellow 4Ghz, then the 28nm equivalent of 8C Zen would roughly use 160-200W. If you take this rough estimate and shrink it by 30% to account for 14nm transition, XV equivalent of Zen would use 110-140W.

This is where 2 more factors come in:
1. Zen has more processing resources, hence uses more power per clock. I'll take KTEs numbers and add 30-40W more, although since it's an absolute value it's not clear to me whether this ballpark value was considered relative to 28nm or 14nm.
2. Zen offers significantly more performance per clock, hence can be clocked significantly lower and still come out on top performance wise. This point is critical, because lowering frequency by 10-20% in a high performance product can easily drop power usage by as much as 20-40%.

So, time for even more dirty napkin math:
Best case scenario: (110W+30W)x0.6 = 84W
Worst case: (140W+40W)x0.8 = 144W

Average is 114W for a product with operating frequency somewhere between 3.2Ghz and 3.6Ghz.

Napkin math checks out, there's an operating frequency somewhere above 3Ghz where our fictional product will hit 95W TDP.

However, as The Stilt and others mentioned several times before in this thread, it all comes down to the 14nm process used by AMD being able to clock significantly higher than 3Ghz and still maintain it's advantage over the best 28nm has to offer. The billion dollar question is where does this advantage start to fall off, and this is where Zen may or may not run out of gas.

Hold on, why would you require a 16C XV to a 8C Zen? Even AMD compares core to core, and a XV module got 2 cores.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,907
3,517
136
Hold on, why would you require a 16C XV to a 8C Zen? Even AMD compares core to core, and a XV module got 2 cores.
outside of load and store a CON core module has around the same amount of resources* as a Zen core.

*resources=
ALU
FPU
registers
retirement
instruction decode
branches
etc,etc
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
outside of load and store a CON core module has around the same amount of resources* as a Zen core.

*resources=
ALU
FPU
registers
retirement
instruction decode
branches
etc,etc

But that doesn't mean it can be used. Hence the need for SMT.

Remember this from way back?


This is also why there can be so massive power draw difference between 2 100% CPU loads.
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,907
3,517
136
But that doesn't mean it can be used. Hence the need for SMT.
Of course it can't ( in a general purpose workload) but it still makes a comparison between 8 CON modules and 8 Zen cores a reasonable one . With the much improved cache, improved predictors and prefetchers, despite the big load store deficit a Zen core rivaling a VX module in throughput in a whole range of workloads is a real possibility.
 

coercitiv

Diamond Member
Jan 24, 2014
6,593
13,908
136
AMD stated that Zen has same energy for clock than XV core, at same clock.
Good for them, but I'll take the more cautious route and add those 30-40W anyway.

Hold on, why would you require a 16C XV to a 8C Zen? Even AMD compares core to core, and a XV module got 2 cores.
Please show a minimum of respect for this conversation.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
Of course it can't ( in a general purpose workload) but it still makes a comparison between 8 CON modules and 8 Zen cores a reasonable one . With the much improved cache, improved predictors and prefetchers, despite the big load store deficit a Zen core rivaling a VX module in throughput in a whole range of workloads is a real possibility.

No it doesn't. And you lack benchmarks supporting your claim as well. Remember, AMD claims the performance benefit from core to core compare. Not module to core.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,907
3,517
136
No it doesn't. And you lack benchmarks supporting your claim as well. Remember, AMD claims the performance benefit from core to core compare. Not module to core.
And you lack the benchmarks to prove otherwise, what we do have is details around the uarch and resources. Once you start looking at instruction mixes ( something i bet you have never done) you can get a pretty good idea about which workloads will see comparable scaling to a CON core module.

also its pretty simple maths
having a look here:
http://www.anandtech.com/bench/product/1544?vs=1542

see all those benchmarks that get good SMT scaling (like ~50%) those workloads with a 40% IPC improvement (zen vs vx module) will see a Zen core getting around a VX modules throughtput per clock.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
And you lack the benchmarks to prove otherwise, what we do have is details around the uarch and resources. Once you start looking at instruction mixes ( something i bet you have never done) you can get a pretty good idea about which workloads will see comparable scaling to a CON core module.

also its pretty simple maths
having a look here:
http://www.anandtech.com/bench/product/1544?vs=1542

see all those benchmarks that get good SMT scaling (like ~50%) those workloads with a 40% IPC improvement (zen vs vx module) will see a Zen core getting around a VX modules throughtput per clock.

This compare is better due to more equalized clock.
http://www.anandtech.com/bench/product/1646?vs=1554

So we are already on the cherry picking part? Also remember, Zen is AMDs first attempt at SMT.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,907
3,517
136
This compare is better due to more equalized clock.
http://www.anandtech.com/bench/product/1646?vs=1554

So we are already on the cherry picking part? Also remember, Zen is AMDs first attempt at SMT.
no we are on the ShintaiDK cant read what i write part......

Also AMD are on their 5th uarch that have share threads that share resources ( only load and store and integer execution didn't share threads in a CON module). They are also on their 7th uarch with physical register files go learn what a PRF is and it will be obvious why SMT isn't really a big deal for it.
 
Last edited:

bjt2

Senior member
Sep 11, 2016
784
180
86
Good for them, but I'll take the more cautious route and add those 30-40W anyway.


Please show a minimum of respect for this conversation.

Have you read the remaining of my post?
I gave solid evidence that that, was not an optimistic claim...
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |