New Zen microarchitecture details

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
We are 7-8 months from ZEN release, by now AMD should be preparing to start the production of ZEN and you saying they dont actually know how its going to perform ???

We havent seen KabyLake either, are you saying Intel doesnt know how its actually going to perform ???

Really man, you always have to talk negative on an AMD thread no matter what.

Microsoft didn't set final XBone clocks until weeks before shipping units. So yes, it's very possible they don't know how it's going to perform yet.

Let me point you to Bulldozer, do you think AMD knew how it was going to perform 7-8 months from release? You know, all the time they were saying it was going to be 40% faster than Clovertown or whatever.
 

swilli89

Golden Member
Mar 23, 2010
1,558
1,181
136
It should be a significant improvement. I might put them side by side into a table together with some Intel uarchs, as IDC asked for. A short comparison:
Zen vs. XV as ST(MT for BD):
Decode: 4 + uOp$ vs. 4(4)
ALU: 2 vs. 2(4)
AGU: 2 vs. 2(4)
FMAC: 2 vs. 2
FMUL+FADD : 4 vs. 2
L1 D$: 32kB vs. 32kB (64kB)
L1 I$: 32kB? + uOp$ + more ITLBs vs. 96kB
Branch prediction: likely better in Zen, and checkpointing reduces effective branch misprediction penalty
L2$: 512k (faster) vs. 1MB (slow)
power efficiency (process normalized): likely goes up

ST will see a jump from wider uarch, while MT depends on power efficiency.

I can't thank you enough DB, for enriching an actually technically interesting discussion about an upcoming product!
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,952
3,634
136
Thank you for the updated diagram. So . . . is anyone worried that AMD may be limiting Zen's performance in certain applications by continuing to use 128-bit FMACs?
no, its not the FPU width that is the issues, its the load/store bandwidth as there are 4 units in the FPU . its one thing to be able to clock up and down 1/2 the unit like intel does but everything the data forwards through needs to be 256/512bit wide(ZEN is 128/256).


Is Zen going to have to split 256-bit AVX/AVX2 instructions in hardware like XV?

Yes but if its not FMA then Zen has a latency advantage (atleast over bulldozer), if its FMA then i have no idea, as how FMA exactly works on Zen is still unknown.

It should be a significant improvement. I might put them side by side into a table together with some Intel uarchs, as IDC asked for. A short comparison:
Zen vs. XV as ST(MT for BD):
Decode: 4 + uOp$ vs. 4(4)
ALU: 4 vs. 2(4)
AGU: 2 vs. 2(4)
FMAC: 2 vs. 2
FMUL+FADD : 4 vs. 2
L1 D$: 32kB vs. 32kB (64kB)
L1 I$: 32kB? + uOp$ + more ITLBs vs. 96kB
Branch prediction: likely better in Zen, and checkpointing reduces effective branch misprediction penalty
L2$: 512k (faster) vs. 1MB (slow)
power efficiency (process normalized): likely goes up

ST will see a jump from wider uarch, while MT depends on power efficiency.
looks like a typo here
 
Last edited:

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Yes but if its not FMA then Zen has a latency advantage (atleast over bulldozer), if its FMA then i have no idea, as how FMA exactly works on Zen is still unknown.
I looked into some more recent FPU related papers with novel ideas. That FP0+FP3|FP1+FP3 thing in the first patch had triggered me. Although the new GCC patch changed that to FP0|FP1 to give the compiler a simpler model of what's going on (it just has to optimize, not to know everything), there was a reason for involving FP3 in the first place. One of the simpler ones is to avoid needing 3 FPRF read ports per FMAC unit by providing the third operand (plus some preprocessing) via another unit. Since one FADD unit has 2 ports, it could use either of them to support one FMAC unit each.

looks like a typo here
Thanks. Corrected.
 
Last edited:

Boze

Senior member
Dec 20, 2004
634
14
91
To be honest, I'm not interested in a 95 watt part. And I'm getting more than a little sick of people who are. Since when did enthusiasts care about wattage?

I care about performance. Give me a 125W, 140W, 150W, or even a 160W processor with 6-8 cores at 3.4 gHz or higher and boosting to 4.0. Then I'll be impressed. If you can't afford to run 2-4 light bulbs worth of electricity, you have bigger problems than a $200 processor and probably need to spend your $1000 to build a new system on a training program in your chosen field to make more money.

Sorry to do it, but like Paul Mooney, I gotta keep it real.

Let me see a top-end Zen that's 140W, 8 cores, 16 threads, 3.2-3.5 gHz with boost of 3.7-4.0 and matches an i7-5930K and I'll be impressed. If they manage to match the performance of Broadwell-E, I'll buy one just to give them support, because they'll have earned it.
 

swilli89

Golden Member
Mar 23, 2010
1,558
1,181
136
To be honest, I'm not interested in a 95 watt part. And I'm getting more than a little sick of people who are. Since when did enthusiasts care about wattage?

I care about performance. Give me a 125W, 140W, 150W, or even a 160W processor with 6-8 cores at 3.4 gHz or higher and boosting to 4.0. Then I'll be impressed. If you can't afford to run 2-4 light bulbs worth of electricity, you have bigger problems than a $200 processor and probably need to spend your $1000 to build a new system on a training program in your chosen field to make more money.

Sorry to do it, but like Paul Mooney, I gotta keep it real.

Let me see a top-end Zen that's 140W, 8 cores, 16 threads, 3.2-3.5 gHz with boost of 3.7-4.0 and matches an i7-5930K and I'll be impressed. If they manage to match the performance of Broadwell-E, I'll buy one just to give them support, because they'll have earned it.

So much this. You don't see people crying about the new Shelby GT's fuel economy (hint: its awful) they just love the fact that its V8 sounds like a Ferrari and that it performs. Give me a 150W CPU and a 300W GPU for the rest of eternity, I couldn't care less.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,786
136
I care about performance. Give me a 125W, 140W, 150W, or even a 160W processor with 6-8 cores at 3.4 gHz or higher and boosting to 4.0.

Problem isn't reaching those TDP figures. The problem is that for a 50% power increase you might get 5% faster clocks. 5% IPC might be better for that but that messes up the entire CPU line, making it unsuitable for more profitable and bigger markets like power efficient mobile chips. So the only thing you gain really is faster clocks.

The fact that all high performance CPUs have not reached beyond 4.5GHz for stock, and extreme LN2 overclocks in the 6-7GHz range tells us that there is a fundamental limit that won't be broken. The highest clock is set at 8.5GHz+ range with extreme unbalanced CPUs like Netburst based ones, and Bulldozer derivative FX chips.

And the bigger issue with higher TDP is dissipating them. What if you got a CPU that clocks 10% higher but need a minimum of water cooling to reach acceptable temperatures? And increasing TDP just delays the inevitable. After 160W, people would clamor for 300W. Then after that 600W? 1200W? Intel was saying during the Netburst days CPUs might have the heat density that a surface of our Sun has!
 
Last edited:

el etro

Golden Member
Jul 21, 2013
1,581
14
81
95W Limited Zen may disappoint enthusiasts, but at this point is unlikely that AMD will be able to out a 150W CPU so early. 95W will already be the hungriest processor ever made on 14LPP, AMD will have some more time to master the process, to yields get better and then to deliver higher-clocked parts.

I don't believe FX9590 is the last processor of Centurion series. Once design expertise(this one will get the most time), process maturity and yields at 14LPP achieve ideal levels they will out a FX9000 processor again.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,952
3,634
136
This was the base for applying Gauss Copula models to the financial markets leading to the subprime mortgage crisis.

But as NTBMK said, it's good to have some salt in a discussion without new data points regarding clock frequency and power.

On top of this as I have said numerous times before to the Teh Bulldozer ZUXOR so will Zen crowd, many of the components of Bulldozer appear top notch, its the overall design target that was tragically wrong. They never give a rebuttle, just trash the next thread that comes along:
http://forums.anandtech.com/showpost.php?p=37929130&postcount=248
http://forums.anandtech.com/showthread.php?p=37825909&highlight=int#post37825909
http://forums.anandtech.com/showthread.php?p=37824147#post37824147

As can been seen from your initial work, the design target looks to focus on things that will make Zen a CPU the majority of the market would want, they will leverage what works well in CON and CAT and recreate new what didn't work/wont scale.
 

deasd

Senior member
Dec 31, 2013
569
924
136
This FO4 number would tell a lot. For ARM and the 14nm/16nm implemented variants I have some estimations and statements (e.g. from Broadcom), allowing for some filling in the holes between designs at different processes. And although many parts of Zen will also be used in K12, there is not much of a relation between these two worlds.

Some timing constrained solutions described in patents could serve for any estimations. But there likely is only little ROI in finding that out with still a lot of error margin left. So far I think, Zen will use a typical value of 20-25 FO4 delays per stage. BD was 17 FO4 (12+5) design.

To estimate clock frequencies for given voltages, we also need to know, which standard cell libs they use, how their thresholds are for inserting LVT transistors, etc.

25 FO4 delay sounds a bit low, I think if the core is complex enough it might have much more delay.

L2$: 512k (faster) vs. 1MB (slow)

The L2 is small, wonder if L3 is still here in some low-end model. Zen APU might have room for L3 if the core+SMT is smaller than a Carrizo module.
 

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
Microsoft didn't set final XBone clocks until weeks before shipping units. So yes, it's very possible they don't know how it's going to perform yet.

I don't doubt that AMD is still figuring out what SKUs to release and what clock rates to use. But just because they're deciding (just as a hypothetical example) whether their flagship 8-core should have a base clock of 3.0 or 3.2 GHz, it doesn't mean that they don't already have a pretty clear idea of where it's going to land.

Let me point you to Bulldozer, do you think AMD knew how it was going to perform 7-8 months from release? You know, all the time they were saying it was going to be 40% faster than Clovertown or whatever.

Most of the Bulldozer hype was due to one man - John Fruehe. He spammed multiple boards with blatantly false statements about increased IPC, and did a tremendous amount of damage to AMD's reputation in the process.

The actual information released by engineers before release were much more modest - one official PowerPoint slide mentioned "knee-of-the-curve IPC and low gates/clock". Talk like that should have been a warning sign that IPC wasn't going to be competitive with Intel's designs. Some people didn't believe it, in part because of JF-AMD's lies, in part because the idea of AMD creating their own "Netburst" design after Intel had abandoned it as a failure seemed too stupid for words.

(Honestly, I'm surprised no one sued AMD over the lies. It seems to me that someone who bought an AM3 board in anticipation of Bulldozer, as quite a few forum-goers did, would have a cause of action if they were duped into doing so by JF-AMD's false statements.)

The difference with Zen is that we know AMD, this time, at least had internal awareness of what was wrong and what needed fixing. AMD's own executives said that Bulldozer was an "unmitigated failure". The 2015 Financial Analyst Day presentation devotes a whole slide to saying that IPC in Zen will be up 40% from Excavator, and also specifically mentions improving cache latency (which we know is a construction core weak point). And they brought in the people they needed, especially Jim Keller, to do it right this time.

Again: I don't expect miracles. I do expect a solid, competitive offering, about as competitive as Thuban was with contemporary Nehalem SKUs. My prediction is that IPC will be slightly above Sandy Bridge levels (but below Haswell) in most typical workloads. On clock speeds, I'd expect to see the top 8C/16T HEDT model with a base clock of 3.0-3.5 GHz, and the 6C/12T harvested die to have a base clock of 3.5-4.0 GHz.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
I could be wrong but it seems to me Integer Throughput will be less per ZEN core vs Excavator Module.

That means a Quad Excavator Module (8 threads) may be faster than a Quad Core + HT (8 Threads) ZEN in some MT workloads (8x concurrent).

But im expecting each ZEN core to be smaller than Excavator Module (node normalized)
 

deasd

Senior member
Dec 31, 2013
569
924
136
I could be wrong but it seems to me Integer Throughput will be less per ZEN core vs Excavator Module.

That means a Quad Excavator Module (8 threads) may be faster than a Quad Core + HT (8 Threads) ZEN in some MT workloads (8x concurrent).

But im expecting each ZEN core to be smaller than Excavator Module (node normalized)

BD has more INT but less FP unit, it wouldn't be surprised if Zen has less INT or multithread performance compared to whole EXV module, even if Zen has SMT. But in FP the situation might be different.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,952
3,634
136
BD has more INT but less FP unit, it wouldn't be surprised if Zen has less INT or multithread performance compared to whole EXV module, even if Zen has SMT. But in FP the situation might be different.

BD has int scheduling issues, it only had 2 ALU's, branch, jump and mul on the same port.

thats why they moved instructions to the AGU's that in every other AMD CPU form K7 onwards had on the ALU's. It also had very poor L2 latency, high mispredict penalty also had other things like poor L1I associativity. All these things hurt single thread performance where you are racing to stall.

Given Zen's 4 ports its likely going to have a better spread (we dont know what every pipe can do yet, only what ZEN patch tells us)and its all round better IPC target i wouldn't assume EXV to have more throughput per module just yet.

We also have to see what AMD do on the SMT front, intels SMT isn't the only SMT :sneaky: , dresdenboy has hinted at this a few times so i guess he has found some interesting AMD patients or something.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
tatertot's not banned


tatertot?

Sweepr and I relate to terrace215 from XS forums. Who got banned in a combination of AMD fans(Mass reporting and advocacy program), AMD employees(John Fruehe) and paid(as low as a couple of Opterons was enough) of some XS mods to protect the illusion of Bulldozer performance. In the end of the day, terrace215 was right and only got banned for PR reasons.

Anyone that dared to question Bulldozers performance was shills, AMD haters, liars, trolls and worse.

Somewhat similar had "oddly enough" happened before with the Phenom release that was going to beat Intel with 40-50%.
 
Last edited:

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
@dresdenb - classic !
Whats your take of mm2 vs core?
I still think 4-5mm² isn't too far off the mark. The limiting factor isn't so much the amount of logic (as 14LPP would allow for ~70% area reduction vs. 28nm, but then come design styles, libs, etc.), but the W/mm².
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
4-5mm2 with 512KB L2 would be quite smaller than the 7-8mm2 A9/A9X (without L2) and Broadwell/Skylake (with 256KB L2) cores on 14/16nm.
 
Last edited:

NTMBK

Lifer
Nov 14, 2011
10,328
5,383
136
4-5mm2 with 512KB L2 would be quite smaller than the 7-8mm2 A9/A9X (without L2) and Broadwell/Skylake (with 256KB L2) cores on 14/16nm.

Given that it's not getting 256-bit datapaths and SIMD units, I would not be surprised if it is smaller than Broadwell/Skylake.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,712
1,241
136
My guess is in flux but...

I'm going around 10 mm² to 18 mm² per core, wide guess. This isn't a LP design, it's a HP design with wide DVFS. Which means Vt spacing, long gate lengths for everything, but sLVT transistors etc.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |