New Zen microarchitecture details

Dresdenboy · May 5, 2016

mikk said:
You don't understand my point. Without a core count advantage CPU performance is all up to IPC and x Mhz, means it's harder for AMD to compete with Intel. If a 4 core version is enough for mobile is a different question.

If customers only buy for running CB ST, then this is as important as you imply. What if a CCX uses less power at the same clock frequency (FPU differences, 256b core power, iterative multiplier, etc.) and can sustain high clocks for CPU+GPU?

Gundark · May 6, 2016

Anyone else think that Zen is a great name that marketing team could do wonders with it? If they fail to monetize on this opportunity, they should be lined up against the wall.

nenforcer · May 6, 2016

Gundark said:
Anyone else think that Zen is a great name that marketing team could do wonders with it? If they fail to monetize on this opportunity, they should be lined up against the wall.

That's just the development code name - I'd be surprised if they kept the AMD FX moniker but I've heard nothing to the contrary to say they won't.

VirtualLarry · May 6, 2016

I think that they should call them "Phenom X", to go along with "Fury X".

coercitiv · May 6, 2016

VirtualLarry said:
I think that they should call them "Phenom X", to go along with "Fury X".

Although Zen and Fury make up quite an interesting dichotomy.

Dresdenboy · May 6, 2016

coercitiv said:
Although Zen and Fury make up quite an interesting dichotomy.

+1

The Stilt · May 6, 2016

Seeing how strict VRM tolerances are used on Zen motherboards makes be a bit worried about the process characteristics / the available headroom :hmm:

When AMD moved from 32nm SHP SOI to 28nm BULK the voltage stability became extremely important. Despite the platforms using parts made with different processes (e.g AM3+ and FM2+) had exactly the same load-line specification (1.3mOhms) in reality the smaller and othewise inferior 28nm process was significantly more sensitive to voltage variations / fluctuations.

Achieving a stable voltage supply through proper (load dependent) load-line calibration can result in hundreds of MHz additional headroom when close to Fmax, even on the more recent 28nm (Godavari) chips.

For Zen the load-line appears to be (based on the existing VRM designs) significantly tighter than it was with previous AMD designs and much tighter than the Intel VR12 spec (which is already strict) specifies...

Makes me wonder if Zen is either floored from the factory or is the process itself just extremely sensitive to voltage fluctuations.

Dresdenboy · May 6, 2016

The Stilt said:
For Zen the load-line appears to be (based on the existing VRM designs) significantly tighter than it was with previous AMD designs and much tighter than the Intel VR12 spec (which is already strict) specifies...

Makes me wonder if Zen is either floored from the factory or is the process itself just extremely sensitive to voltage fluctuations.

Does this fit to a design with very low voltage margins? It could also mean a higher average clock frequency at given TDP with many active cores.

How do A8/A8X voltage requirements look like?

The Stilt · May 6, 2016

Dresdenboy said:
Does this fit to a design with very low voltage margins? It could also mean a higher average clock frequency at given TDP with many active cores.

There are three possible reasons I can think of:

- Limited Vmax (to keep the voltage below certain threshold in low load / current draw conditions)
- Rapid & extremely frequent frequency (PState) switching, due power management (like Carrizo at low TDPs). This would cause large changes in current draw and cause larger voltage fluctuation, especially when certain states are restricted to n cores active while the others are available when all cores are utilized.
- A process characteristic when close to Fmax (like on 28nm BULK processes, compared to 32nm SHP SOI).

Impossible to say the real cause yet, but I would think it has something to do with the process characteristics. If it was a power optimization (second scenario) AMD probably would have used it on Carrizo too. Carrizo supports load-line adjustment through SW.

Flash831 · May 6, 2016

The Stilt said:
There are three possible reasons I can think of:

- Limited Vmax (to keep the voltage below certain threshold in low load / current draw conditions)
- Rapid & extremely frequent frequency (PState) switching, due power management (like Carrizo at low TDPs). This would cause large changes in current draw and cause larger voltage fluctuation, especially when certain states are restricted to n cores active while the others are available when all cores are utilized.
- A process characteristic when close to Fmax (like on 28nm BULK processes, compared to 32nm SHP SOI).

Impossible to say the real cause yet, but I would think it has something to do with the process characteristics. If it was a power optimization (second scenario) AMD probably would have used it on Carrizo too. Carrizo supports load-line adjustment through SW.

Aren't Zen motherboards = Bristol Ridge motherboards?
Could this be a reason how AMD managed to raise the clock from Carrizo?

The Stilt · May 6, 2016

Flash831 said:
Aren't Zen motherboards = Bristol Ridge motherboards?
Could this be a reason how AMD managed to raise the clock from Carrizo?

They are compatible yes.

Carrizo (by the original design) has several times higher load-line spec than FM2+ or AM3+, so Carrizo / Bristol Ridge compatibility is not the reason for it.

Bristol Ridge technically didn't raise the clocks compared to Carrizo. AMD just enabled more sufficient TDP configuration on AM4 Bristol Ridges (up to 65W) and raised the clocks as high as possible (by blowing the voltages through the ceiling). That's also the reason why there won't be any unlocked Bristol Ridge SKUs even for AM4 (AFAIK), since there is nothing more to be squeezed out of it.

Abwx · May 6, 2016

Dresdenboy said:
Does this fit to a design with very low voltage margins? It could also mean a higher average clock frequency at given TDP with many active cores.

AM3+ was designed for up to 220W CPUs, so it s no wonder that it had lower voltage losses within the VRM--Socket---CPU path, hence the apparent (only) lower voltage variation in respect of 95W Sockets...

For instance 2.5mm2 copper section of 10cm length has about 0.68 millihoms resistance, to compare with the 0.0013R load resistance quoted by TheStilt.

As for voltage margin it must be 10% at a minimum, wich means that the circuit must work at 0.9x the nominal voltage.

Doom2pro · May 6, 2016

The Stilt said:
They are compatible yes.

Carrizo (by the original design) has several times higher load-line spec than FM2+ or AM3+, so Carrizo / Bristol Ridge compatibility is not the reason for it.

Bristol Ridge technically didn't raise the clocks compared to Carrizo. AMD just enabled more sufficient TDP configuration on AM4 Bristol Ridges (up to 65W) and raised the clocks as high as possible (by blowing the voltages through the ceiling). That's also the reason why there won't be any unlocked Bristol Ridge SKUs even for AM4 (AFAIK), since there is nothing more to be squeezed out of it.

So your saying Carrizo was basically running at optimal clock and power, that it wasn't designed to be a desktop and downclocked for less power, that it was designed for mobile and basically Bristol Ridge is going to be overclocked mobile parts?

The Stilt · May 6, 2016

Doom2pro said:
basically Bristol Ridge is going to be overclocked mobile parts?

For the quoted part, yes. A desktop part wouldn't have eight PCI-E lanes available for dGPU, like Carrizo / Bristol Ridge does. In total Carrizo has 8x + 4x + 4x PCI-E lanes. The 8x is reserved for dGPU, 4x is GPP (general purpose) and 4x for UMI (FM2+ only, for external FCH communication).

Dresdenboy · May 7, 2016

Although made with a different process than Zen, the high clocks of Nvidia's 1070 and 1080 possibly show an interesting development: trade size for clocks while still being very power efficient. There seems to be no need to go wide @ low clocks for efficient design sweet spots. Lower mem PHY area at higher Gbps doesn't call for big dies, too.

For a final look at this we need die sizes of course.

itsmydamnation · May 7, 2016

Dresdenboy said:
Although made with a different process than Zen, the high clocks of Nvidia's 1070 and 1080 possibly show an interesting development: trade size for clocks while still being very power efficient. There seems to be no need to go wide @ low clocks for efficient design sweet spots. Lower mem PHY area at higher Gbps doesn't call for big dies, too.

For a final look at this we need die sizes of course.

how many stages to you think the Zen pipeline is? mid 20's? high teens?

Dresdenboy · May 7, 2016

itsmydamnation said:
how many stages to you think the Zen pipeline is? mid 20's? high teens?

TL;DR: High teens (at least).

Dresdenboy at SA said:
Code:

Architecture Branch Misprediction Penalty AMD K10 12 cycles AMD Bulldozer 20 cycles Pentium 4 (NetBurst) 20 cycles Core 2 (Conroe, Penryn) 15 cycles Nehalem 17 cycles Sandy Bridge 14-17 cycles

Source: http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/2

Zen's pipeline won't be as short as K10's. And it doesn't need to be, as improved branch prediction plus some (rumoured) additions like checkpointing (before taking a hardly predictable branch), µOp cache, SMT (run the other thread at higher IPC in the meantime) reduce the perceived cost of mispredictions.

Some AMD patents show the cycles, for example US20140025933:
MAP, RDY, SCH, XRF, EX0, [EX1..], RE0, RE1, RE2 (Seronx, did you get the 3 retire stages from this patent?)
Before that there should be IT0, IT1/IC0, IT2/IC1-ICn, DEC0-DECm (parallel BP?) (see US20150121050).
That adds up quickly. If you compare those parts to Jaguar with a 14 cycle branch misprediction penalty, it looks to be at least as long for Zen, if not longer.

Of course, more stages could be required for the added units and increased complexity (wider schedulers, FMA support, checkpointing, SMT, etc.).

http://semiaccurate.com/forums/showpost.php?p=258949&postcount=2339

Nothingness · May 7, 2016

To complete the picture about branch misprediction:

Code:

Ivy Bridge 14 cycles
Haswell    14-15 cycles
Skylake    16-17 cycles

Source: http://7-cpu.com/

JDG1980 · May 7, 2016

Dresdenboy said:
Although made with a different process than Zen, the high clocks of Nvidia's 1070 and 1080 possibly show an interesting development: trade size for clocks while still being very power efficient.

Yep. Looks like the Netburst/Bulldozer strategy (high clocks, low/moderate IPC) actually works on GPUs, while it has been a total bust on CPUs so far.

Part of the problem is that there seems to be a natural CPU "speed limit" of about 4.5 - 5.0 GHz. Toward the top end of that range, power usage shoots through the roof for smaller and smaller clock speed gains. If Piledriver could have hit 6.0 GHz at ~140W, it might actually have been competitive with Intel at the time. But that didn't happen.

Dresdenboy · May 7, 2016

JDG1980 said:
Yep. Looks like the Netburst/Bulldozer strategy (high clocks, low/moderate IPC) actually works on GPUs, while it has been a total bust on CPUs so far.

IMO NW was OK, Prescott on the wrong process, and BD likely was the result of tradeoffs not favoring generic DT code.

IBM's CPUs for example show, that a HF design isn't necessarily bad.

JDG1980 said:
Part of the problem is that there seems to be a natural CPU "speed limit" of about 4.5 - 5.0 GHz. Toward the top end of that range, power usage shoots through the roof for smaller and smaller clock speed gains. If Piledriver could have hit 6.0 GHz at ~140W, it might actually have been competitive with Intel at the time. But that didn't happen.

That could simply be a tradeoff to have a better design for many of the intended use cases.

HiroThreading · May 7, 2016

Dresdenboy said:
IMO NB was OK, Prescott on the wrong process, and BD likely was the results of tradeoffs not favoring generic DT code.

Do you meant Northwood was OK, or that NetBurst as an architecture was OK? Two very different statements.

Prescott was a disaster on 90nm, but even when ported to 65nm, Cedar Mill was pretty rubbish. Granted, Cedar Mill probably could have scaled to mid-4GHz speeds as a single core design. However, by that stage, the multicore era was unleashed and Intel never clocked Cedar Mill above 3.73GHz and mostly sold them as the MCM Pentium D (90-120W TDP).

Bulldozer was just a horrid architecture, much like NetBurst.

IBM's CPUs for example show, that a HF design isn't necessarily bad.

Yes, but they also have 250-300W TDPs to play with, IIRC.

Dresdenboy · May 7, 2016

HiroThreading said:
Do you meant Northwood was OK, or that NetBurst as an architecture was OK? Two very different statements.

Prescott was a disaster on 90nm, but even when ported to 65nm, Cedar Mill was pretty rubbish. Granted, Cedar Mill probably could have scaled to mid-4GHz speeds as a single core design. However, by that stage, the multicore era was unleashed and Intel never clocked Cedar Mill above 3.73GHz and mostly sold them as the MCM Pentium D (90-120W TDP).

Bulldozer was just a horrid architecture, much like NetBurst.

Sorry, meant NW of course.

The Stilt · May 7, 2016

Netburst was horrible, but I don't think Intel was ever as much behind AMD in IPC as AMD is at the moment behind Intel, or was it? IIRC K7 had ~33% higher IPC in FP than Northwood.

nonameo · May 7, 2016

The Stilt said:
Netburst was horrible, but I don't think Intel was ever as much behind AMD in IPC as AMD is at the moment behind Intel, or was it? IIRC K7 had ~33% higher IPC in FP than Northwood.

One thing to keep in mind about P4 is that it was designed for high clocks. Northwoods of the time were performing on par(not everywhere, but enough to count) with their AMD counterparts.

SPBHM · May 7, 2016

The Stilt said:
Netburst was horrible, but I don't think Intel was ever as much behind AMD in IPC as AMD is at the moment behind Intel, or was it? IIRC K7 had ~33% higher IPC in FP than Northwood.

and Intel had an easy 1GHz advantage

New Zen microarchitecture details

Golden Member

Member

Golden Member

No Lifer

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member

Member

Golden Member

Lifer

Senior member

Golden Member

Golden Member

Platinum Member

Golden Member

Diamond Member

Golden Member

Golden Member

Member

Golden Member

Golden Member

Diamond Member

Diamond Member