New Zen microarchitecture details

Abwx · Oct 12, 2016

KTE said:
I'm going to ignore the leaks for now. AMD is talking about K8 and being back at 20% TAM in the server world and winning the high performance DT market. They're saying Zen will be competitive top to bottom, price, performance and power. From what AMD is saying, explicit words, their tone and language, I'm expecting minimum 40% IPC increase rather than maximum or average and I am NOW expecting something past the IVB range of performance.

The infos and precisions from AMD point to 3.3-3.5GHz base at 95W TDP, the comsumption given as equal to an EXV core has to be compared at certainly more than 0Hz, likely at the frequencies of Bristol Ridge, that is 3.5-4.2GHz, otherwise the comparison and the number wouldnt even make sense..

On the paper Zen has more FP exe ressources than its competitors, looking at the INT ressources it has more, or at least as much, exe ressources as well according to Hardware.fr, so throughput wise it should be competitive (on a core per core basis as mentioned by Papermaster) with Intel current offering, the only unknown is what is the part of the SMT in this total throughput.

Glo. · Oct 12, 2016

Abwx said:
The infos and precisions from AMD point to 3.3-3.5GHz base at 95W TDP, the comsumption given as equal to an EXV core has to be compared at certainly more than 0Hz, likely at the frequencies of Bristol Ridge, that is 3.5-4.2GHz, otherwise the comparison and the number wouldnt even make sense..

On the paper Zen has more FP exe ressources than its competitors, looking at the INT ressources it has more, or at least as much, exe ressources as well according to Hardware.fr, so throughput wise it should be competitive (on a core per core basis as mentioned by Papermaster) with Intel current offering, the only unknown is what is the part of the SMT in this total throughput.

That would mean that 95W CPU from AMD would be directly competitive with 140W 8 core CPUs from Intel.

That is very difficult to believe in. 125W - that is more possible, with 3.5 GHz and 8 cores.

If it really will be 95W it will be genuinely huge engineering achievement.

krumme · Oct 12, 2016

And all that glory in only 4.9mm2 incl l2 If Mathias predictions about size is correct. Less than half the size than A10 or what?. We dont need hype and especially not people using years to refer to that hype calling it crap because lalala spreading negativity all over. Give it some realistic expectations pls. And enjoy for what can be done.

lolfail9001 · Oct 12, 2016

Glo. said:
That would mean that 95W CPU from AMD would be directly competitive with 140W 8 core CPUs from Intel.

That is very difficult to believe in. 125W - that is more possible, with 3.5 GHz and 8 cores.

If it really will be 95W it will be genuinely huge engineering achievement.

Strictly speaking outside of FPU torture tests, 6900k actually consumes under 100 watts even on 4Ghz. Considering that AMD tends to give Typical Board Power nowadays, and Zeppelin has way humbler uncore compared to Broadwell LCC die, 3.5Ghz on all 8 cores is actually less believable so far than 95W TDP on 3.5Ghz 8 core. So, some small cheating regarding TDP to make Abwx happy, and AMD are fully capable of delivering a 95W TDP 8-core with decent performance. Get ready for $600 pricing though.

Abwx · Oct 12, 2016

Glo. said:
That would mean that 95W CPU from AMD would be directly competitive with 140W 8 core CPUs from Intel.

That is very difficult to believe in. 125W - that is more possible, with 3.5 GHz and 8 cores.

If it really will be 95W it will be genuinely huge engineering achievement.

They said that it will ship at higher frequencies than the demonstrated 3GHz base, i dont think that they would have mentioned "higher than 3GHz" if it was 3.1-3.2 that they had in mind, higher means at least 10%.

krumme said:
And all that glory in only 4.9mm2 incl l2 If Mathias predictions about size is correct. Less than half the size than A10 or what?. We dont need hype and especially not people using years to refer to that hype calling it crap because lalala spreading negativity all over. Give it some realistic expectations pls. And enjoy for what can be done.

4.9mm2 must be without L2 because this latter is 512kB...

Btw, there s about 102m transistors in an EXV module and that s excluding the L2, a Zen core should be roughly 25% bigger and at 25m transistors/mm2 this amount to effectively 5mm2, wich make Mathias estimation relevant but not if it include the L2, wich i think he does not.

krumme · Oct 12, 2016

Abwx said:
They said that it will ship at higher frequencies than the demonstrated 3GHz base, i dont think that they would have mentioned "higher than 3GHz" if it was 3.1-3.2 that they had in mind, higher means at least 10%.

4.9mm2 must be without L2 because this latter is 512kB...

Btw, there s about 102m transistors in an EXV module and that s excluding the L2, a Zen core should be roughly 25% bigger and at 25m transistors/mm2 this amount to effectively 5mm2, wich make Mathias estimation relevant but not if it include the L2, wich i think he does not.

Nope. Incl l2.

http://digiworthy.com/2016/08/30/amd-zen-die-size-details-on-quad-core-unit/

Thia article is an estimation from Mathias work.

I dont know if he is at 4.9 or 7mm2 today. The point is. Its a damn small core.

Seems like a different segment to me.

krumme · Oct 12, 2016

And Mathias blog here

http://dresdenboy.blogspot.dk/2016/08/two-days-to-go-until-amds-hot-chips.html?m=1

KTE · Oct 12, 2016

bjt2 said:
This was an ARM presentation on the 14nm FF, with an isolated NEON FPU of an A53 (or 57, don't remember). obviously the leakage is proportional to the transistors number and the isolated NEON FPU is at most 1/6 of a Zen CPU. Even if 18mW are trascurable, we are talking passing from 30% of power wasted in leakage (100mW on 330mW), to 5% (18mW on 330mW)... On a 95W CPU this is from 30W wasted in leakage, to 5W, with the 25W that can be invested in more clock...

That's what I'm saying (if you read the other points). Gains at mW/low MHz/low IPC archs never translate directly to gains for high performance CPUs. Low power CPUs, at every generation in the past 6 years have made HUGE gains.

A53/A57 are ASICs with at least 30 FO4 per stage. This is the reason for this lower clock... Anyway I found a graph that projected to up to 4.3GHz the consumption of this NEON FPU, being about 1W. Even if Zen draw 10 times this FPU and is done with 30 FO4, should draw 10W/core at 4.3GHz...

They are in-order dual issue 8-stage ULTRA low power cores with max power around 0.8W though. They really are not comparable and neither does scaling occur as such (or Arm can just scale up and beat Intel today lol).

Frequency scaling also differs between the two in the same way.

And even then bjt2... While you're on A53... Have you seen how power more than DOUBLES from 200MHz to 400MHz for the A53? [from anandtech]

From 500MHz to 1GHz it more than TRIPLES.

After 900MHz, +100MHz increase (11%) makes a +135mW increase (33.7%)!

And, when Arm increased perf 30-40% from A7, they also 2.22x power!

Real world CPU constraints are very different to research papers

I was talking of medium IPC because I was supposing to start from an high IPC desing, break the stages in more pieces, with the goal of increasing clock and so losing some IPC for the longer latencies and so longer branch misprediction penalities... But if the branch predicition is good and we add only 2.5 FO4 per stage, a 17.5 FO4 per stage gives 15 FO4 for the logic and maybe something useful can be done...

That's the FO4 inverter delay for the brilliant Alpha 21264

It is possible, yes, but extremely difficult and requires a big budget, plenty of research time with good management.

Sent from HTC 10
(Opinions are own)

itsmydamnation · Oct 12, 2016

A57 isn't in order and is tripple issue with a very wide execution stage, in hind sight A57 looks like ARM getting caught with their pants down (64bit) and they released what they had. A72 has been far better on power while improving performance.

edit: A17 is dual issue (i think, not sure if its stil in order)

bjt2 · Oct 12, 2016

KTE said:
That's what I'm saying (if you read the other points). Gains at mW/low MHz/low IPC archs never translate directly to gains for high performance CPUs. Low power CPUs, at every generation in the past 6 years have made HUGE gains.

The Vcore was 0.8 and the frequency 2.41GHz, the leakage goes with Vcore and temperature. If at 0.9V on 28nm leakage is 30% of total power, at 1.35V (Vcore of excavator) how much it is? Obviously in proportion probabilty it's not more than 50%, because the dynamic power of an Excavator CPU is higher... Why it's higher? Because the frequency is higher. But why the frequency is higher? Because the FO4 is lower... If with an high FO4 CPU they can obtain 5% of leakage, imagine what is the percentage with a low FO4 design...

KTE said:
They are in-order dual issue 8-stage ULTRA low power cores with max power around 0.8W though. They really are not comparable and neither does scaling occur as such (or Arm can just scale up and beat Intel today lol).

Frequency scaling also differs between the two in the same way.

You forgot the differences in FO4. It's the FO4 that count. Not OOO or in-order. Surely in order archs are simpler, so requires less stages, but it's the FO4 that count...

KTE said:
And even then bjt2... While you're on A53... Have you seen how power more than DOUBLES from 200MHz to 400MHz for the A53? [from anandtech]

From 500MHz to 1GHz it more than TRIPLES.

After 900MHz, +100MHz increase (11%) makes a +135mW increase (33.7%)!

And, when Arm increased perf 30-40% from A7, they also 2.22x power!

Are you talking of 28nm? Because on 14nm the frequency is much higher... Anyway the 28nm NEON unit went at slightly more than 1GHz at 0.9V and 330mW total consumption. I don't expect similar gains also on high clock CPU (1.x->2.41GHz), but at least same clock as 28nm yes...

KTE said:
Real world CPU constraints are very different to research papers

That's the FO4 inverter delay for the brilliant Alpha 21264

It is possible, yes, but extremely difficult and requires a big budget, plenty of research time with good management.

Sent from HTC 10
(Opinions are own)

2.5 FO4 was present in a powerpoint posted by Dresdenboy as the delay of the flip flops used by AMD... They bought also the bus from 21264...

Anyway if you open the papers that i linked before, the second specifies also that exist 2 FO4 flip flops. It's a matter of energy efficiency: want faster FF? More power. Indeed they concluded that the best for perf/w whas a 15 FO4 arch with 3 FO4 flip flops...

Nothingness · Oct 13, 2016

itsmydamnation said:
edit: A17 is dual issue (i think, not sure if its stil in order)

Cortex-A17 is dual issue, out of order.

hackroute · Oct 13, 2016

does anyone know some about IOMMU like VT-d specification/perfomance on ZEN ?

KTE · Oct 13, 2016

bjt2 said:
Anyway if you open the papers that i linked before, the second specifies also that exist 2 FO4 flip flops. It's a matter of energy efficiency: want faster FF? More power. Indeed they concluded that the best for perf/w whas a 15 FO4 arch with 3 FO4 flip flops...

I have already seen them/even already discussed one in this thread before. But I will reply with more time, very busy at work etc.

FO4 across archs deals with speed... Not power. Power is an arch/ISA thing.

Sent from HTC 10
(Opinions are own)

Dresdenboy · Oct 13, 2016

KTE said:
I have already seen them/even already discussed one in this thread before. But I will reply with more time, very busy at work etc.

FO4 across archs deals with speed... Not power. Power is an arch/ISA thing.

Sent from HTC 10
(Opinions are own)

Same here. Just to add: power efficiency became just another dimension in the design space. Lower FO4 delays per stage (than, say 30) and thus more stages (here it becomes a uarch metric) might give more clock gating opportunities.

bjt2 · Oct 13, 2016

A low FO4 implementation of an arch needs a lower Vcore for a target frequency. Lower Vcore means lower leakage.
A low FO4 implementation of an arch let increase the frequency with the same Vcore. Same leakage, but higher dynamic power and if the IPC loss of going with low FO4/many stages is not much, you increase performance, partially addressing the increase in power.

Moreover same arch, with lower FO4 means more stages and more flip flops, that draw power.

So yes: at same arch, changing FO4 has also implication in power. And this is mentioned in both papers, where they use two different superlinear exponents for the increase in power with the increase of transistors given by the lower FO4.

ElFenix · Oct 13, 2016

KTE said:
Thanks ~~AMD~~ Obama

Sent from HTC 10
(Opinions are own)

1234

KTE · Oct 13, 2016

^^Critical addition: with the same arch, yes

Cuz a 14nm 1V 30FO4 ARM A53 CPU is not the same power/frequency as Deneb at 30FO4 14nm...

Is it?

This correlation is what I'm pointing out to you as throwing things off.

Sent from HTC 10
(Opinions are own)

bjt2 · Oct 13, 2016

You said that

KTE said:
FO4 across archs deals with speed... Not power. Power is an arch/ISA thing.

From this it seems you are saying that power is not influenced by FO4...
Power depends also from FO4, at same arch, as I said. Obviously it depends ALSO from ISA/arch... It's a tautology: different arch/ISA usually does not draw same power...

KTE said:
^^Critical addition: with the same arch, yes

Cuz a 14nm 1V 30FO4 ARM A53 CPU is not the same power/frequency as Deneb at 30FO4 14nm...

Is it?

This correlation is what I'm pointing out to you as throwing things off.

Yes, as said above.

KTE · Oct 13, 2016

bjt2 said:
You said that

From this it seems you are saying that power is not influenced by FO4...

Not at all. That's basics. Anything changing speed, will effect power.

But FO4 correlations -- and how exactly they will influence power@clocks - becomes difficult across ISA/archs.

Sent from HTC 10
(Opinions are own)

Dresdenboy · Oct 13, 2016

krumme said:
Nope. Incl l2.

http://digiworthy.com/2016/08/30/amd-zen-die-size-details-on-quad-core-unit/

Thia article is an estimation from Mathias work.

I dont know if he is at 4.9 or 7mm2 today. The point is. Its a damn small core.

Seems like a different segment to me.

That estimation included the L2$. But now after Hot Chips I learned some more things, which would be enough to up the estimate by ~1mm² incl. L2$, making it ~6mm². The L2$ blocks seem to be quite large.

Well, now I took my stitched die shot and put the Hot Chips CCX schema on it:

I scaled the die shot to fit the CCX schema. Measuring 2 cores I get 69px * 136px. A single core is ~2.39% of the whole area then (which might still include some kerf). Including the L2$ (+some empty space) I get 97px * 136px, or 3.37% for a single core + L2$. At different Zeppelin die sizes, this makes:

Code:

160mm² estimated ZP die size: 3.8mm² core, 5.4mm² core+L2$
180mm² estimated ZP die size: 4.3mm² core, 6.1mm² core+L2$
200mm² estimated ZP die size: 4.8mm² core, 6.7mm² core+L2$

EDIT: This die's new adapted aspect ratio (was guessed before based on a perspective distorted wafer photo) visually fits even better a pair of Zeppelin dies on the data center APU:

jpiniero · Oct 13, 2016

Do we have any idea how big the Zen Server socket is?

Arachnotronic · Oct 13, 2016

jpiniero said:
Do we have any idea how big the Zen Server socket is?

Over 4K pins LGA according to a post from Stilt.

KTE · Oct 15, 2016

bjt2 said:
EDIT2: here http://www.eecs.harvard.edu/~dbrooks/micro2002-optpipeline.pdf is another paper that concludes with 15+3 FO4 (17.5 in the case of AMD that has 2.5FO4 overhead). This target specifically out of order superscalar architectures. The first targeted in order architecture with the mention to the fact that there is not much difference in the outcome and in order is simpler.

I have gathered maybe 30mins now

So these were the theoretical studies (based on Willamette/Alpha 21264 based simulators and SPEC2000) from the GHz race days. It was these that supported Intels mentality... that culminated with Tejas. Guess where that ended up?

The processor world/paradigms have changed so much in that 20 years.

Have you seen a study based on anything AFTER P4 and 65nm concluding the same?

For today's landscape, there are just too many problems with that, some I've already discussed in the previous post. Infinite cache models and perfect front-ends are implicitly assumed. Static power/leakage problems at and after 65nm were totally unknown and unaccounted for with the linearly correlating PowerTimers. Vth used to scale back then, improving the clocks/power massively. Transistors on chip then vs now. They didn't even account for on-chip caches etc when it came to power. The advent of MT was again, unknown and unaccounted for. That study even starts with a low IPC arch and then scales.

The calcs even start with leakage power as 0.1 fraction compared to dynamic power !! Excellent research but outdated for today like I've previously said

NB: I work for one of them firms.

Sent from HTC 10
(Opinions are own)

bjt2 · Oct 15, 2016

They model pipeline stalls and table 1 depict the differences between the theroetical model with infinite cache etc, and the real model with the stalls. Page 4, 4th paragraph, table 1.

"Each latency in Table 1 has two values: the first labeled STD, is for our detailed simulation model, and the second labeled INF, assumes infinite I-Cache, I-TLB, DTLB, and a perfect front-end. The INF simulator model is used for validating the analytical model described in Section 3."

They cover both theoretic case, than a simulation of the real case (a power PC architecture), with a comprehensive simulation.
Have you even read the paper? I did, till end...

Dresdenboy · Oct 16, 2016

Wilco at RWT forum posted an interesting link to a book covering some of the FO4 stuff - and provides a FO4 per cycle chart of lots of CPUs till 2011 (p. 105):
https://books.google.co.uk/books?id=PiWOAwAAQBAJ&pg=PA104&dq=Pentium+FO4+inverter+delays+per+pipeline+stage&hl=en&sa=X&ved=0ahUKEwitzKjppt_PAhUB1xoKHXJpBJkQ6AEILDAC#v=onepage&q=Pentium FO4 inverter delays per pipeline stage&f=false

The book is titled "Top-Down Digital VLSI Design: From Architectures to Gate-Level Circuits and FPGAs" by Hubert Kaeslin.

If you don't get the linked pages, a fresh search might bring you there. The whole book is rather interesting.

EDIT: The data used in the book was based on the CPU DB, which has been used by other authors, too. Here is a not so colorful diagram without distinction of the different companies as in the book:

It's part of this article (with paywall). Edit2: Full article

New Zen microarchitecture details

Lifer

Diamond Member

Diamond Member

Golden Member

Lifer

Diamond Member

Diamond Member

Senior member

Platinum Member

Senior member

Diamond Member

Junior Member

Senior member

Golden Member

Senior member

Elite Member

Senior member

Senior member

Senior member

Golden Member

Lifer

Lifer

Senior member

Senior member

Golden Member