New Zen microarchitecture details

The Stilt · Jun 11, 2016

Based on the scaling of the chart (XV vs. Zen), most likely they are both core to core.

In the slide AMD says "40% IPC improvement" and Zen's column is 80% higher than Excavator's column.

Meanwhile in the Orochi vs. Summit chart, Summit's column is twice higher than Orochi's. If the scaling is the same, the actual performance difference is 50% and not 100%. Pretty close to the figure which you get when you sum the measured difference between Piledriver and Excavator to to expected difference between Excavator and Zen

Abwx · Jun 11, 2016

2x the throughput of a FX8350 is what make more sense.

Assuming a Zen core has same FP throughput than an XV module then 3.3GHz is enough to have the necessary perf.

Remember, a Zen core has at its disposal the same (but improved) FPU that is at disposal of a whole XV module.

inf64 · Jun 11, 2016

To me it is obvious that if AMD listed Excavator core on the same slide as the IPC claim it really means ~40% over Excavator and single core Vs single core(ie. average of a number of ST tests conducted at the same fixed clock speed).

On the other hand , the other slide mentioned "performance leap" and lists Summit Ridge (8C/16 95W SKU) and Orochi ( up to 220W, likely standard 95 or 125W PD based 83xx SKU). Knowing AMD they used a number of MT benchmarks and the slide indicates roughly 2x performance uplift. And no, I don't think that chart represents performance/watt because it states very clearly "performance leap" - ie. throughput.

So to recap: 40% IPC jump core vs core versus XV (or ~55-60% over PD) and 2x throughput jump Vs unknown 83xx SKU, likely 8370E or 8350.

sirmo · Jun 11, 2016

KTE said:
Current core or previous generation means current FX/top-end offering and what the new generation is replacing. Which is PD.

PR talk is segmented carefully. Years don't matter.

"Barcelona is 40% faster, 42%, then 50% faster than Intels best". Sadly what all the hardcore enthusiasts found out after launch was that AMD PR meant at 2.6GHz, versus Clovertown 2.66GHz... which had 3 faster models, and now Harpertown, when Barcelona came to be available at 2.0GHz Those were some sad days...

Sent from HTC 10

Memory lane cuts both ways. Stumbled on this thread from when we were waiting on the Hammer to drop.. fun times: http://www.hardwarecentral.com/showthread.php?15508-AMD%92s-Hammer-now-only-15-faster-IPC%85

One thing we know for sure is that we have no idea how good Zen will be.

JDG1980 · Jun 11, 2016

The Stilt said:
It is, as it is still manufactured and shipped.

I'd be rather surprised if any 32nm Piledriver chips are still being manufactured. More likely, AMD is running down their existing stockpiles. Why make more of an obsolete, uncompetitive product that no one wants?

KTE · Jun 11, 2016

inf64 said:
On the other hand , the other slide mentioned "performance leap" and lists Summit Ridge (8C/16 95W SKU) and Orochi ( up to 220W, likely standard 95 or 125W PD based 83xx SKU). Knowing AMD they used a number of MT benchmarks and the slide indicates roughly 2x performance uplift. And no, I don't think that chart represents performance/watt because it states very clearly "performance leap" - ie. throughput.

Do you remember this inf64?

Performance per watt?

And how awesome AMD painted Bulldozer to be all the way into launch, whooping the 1100T and 2600K... "50% faster than 1100T and i7-950"
https://m.donanimhaber.com/foto-galeriler/AMD-Bulldozer-FX-resmi-test-sonuclari_1.htm
http://forums.anandtech.com/showthread.php?t=2134911

Then we had AT users calculating some really juicy maths arriving at that gorgeously exact 50% performance figure. Sorry to pick on you Podspi

I say these pass the smell test because, assuming IPC doesn't change from Phenom II, 33% increase in cores and 12% increase in clock speed (3.6ghz) would produce 50% greater performance than PhIIX6. Hrm...

Just the same juicy math being made today. We know how that usually turns out...

Sent from HTC 10

itsmydamnation · Jun 12, 2016

KTE said:
Do you remember this inf64?

Performance per watt?

And how awesome AMD painted Bulldozer to be all the way into launch, whooping the 1100T and 2600K... "50% faster than 1100T and i7-950"
https://m.donanimhaber.com/foto-galeriler/AMD-Bulldozer-FX-resmi-test-sonuclari_1.htm
http://forums.anandtech.com/showthread.php?t=2134911

Then we had AT users calculating some really juicy maths arriving at that gorgeously exact 50% performance figure. Sorry to pick on you Podspi

Just the same juicy math being made today. We know how that usually turns out...

Sent from HTC 10

So how about actually looking at Uarch.....

No couldn't possibly do that now could we . We know already about many changes that are all good IPC improvers.

4x L1D a core ( or 2x vs excavator)
much reduced latency L2 ( also smaller)
wider execution resources
lower latency FPU
much better L3

then you have your general improvement in prediction/prefect/load store.
then there is also then unknowns , stack cache, trace cache, checkpointing etc.

Bulldozer was 50% narrower then STARS the same integer width as ATOM/Jaguar. Zen is 100% wider then Bulldozer.

Down play all you want, but that part of the anandtech forum crowd have NEVER ONCE technically been able to justify there position, only "Coz AMD".

edit: unknown not known....

KTE · Jun 12, 2016

itsmydamnation said:
4x L1D a core ( or 2x vs excavator)
much reduced latency L2 ( also smaller)
wider execution resources
lower latency FPU
much better L3

I don't disagree with you that there are said to be some major changes...

But the question is always about by how much?

Can we QUANTIFY this in numbers?

The design philosophy is one thing and the process another, and both separately controlled. Both have to be insync to achieve processor design goals and a successful product at launch. Then there is the timing, which has to be right.

Do I think the Zen design is in the right direction? Yup. The process? It doesn't seem so. The timing? Nope.

L1/L2 latency hurt BD big time with poor branch misprediction rates on a deep pipeline, and then, no uop cache or efficient BTBs to mask this. Small L1D and low cache associativity exacerbated these problems even more. Anything FP/SIMD and there were serious bottlenecks, especially when it required memory ops.

Wider execution is only good where your dispatch/issue, prefetch and fetch, schedulers and all the pointers, trackers, buffers in between are smart, big and fast with excellent tuned schemes or algorithms (including trace cache/fetch buffers) so the EUs are not starved and stalls or structural hazards don't cause huge penalties to mitigate circumstances like when a single stall downstream will cause full i-buffers upstreams with high execution latencies.

Small bottlenecks like i-cache line fetch limited to selecting only one particular line a cycle are the ones that tend to cause major performance penalties when your branch prediction isn't particularly efficient.

then you have your general improvement in prediction/prefect/load store.
then there is also then knowns , stack cache, trace cache, checkpointing etc.

Which can add as little as 2-5% depending on what exactly is done and to what extent, or as much as 30-40%. Major unknowns but we'll possibly have these details after HotChips 28 Aug 21-23 (like BD formerly).

It is even more difficult to estimate the gain because this Zen design is not supposed to be a K7 based Thuban improvement, so we know where to compare and estimate, nor from BD, but something completely different. Each of them had their own design philosophy and process based bottlenecks. Less than spectacular process, far behind parity and die size magnified each of their struggles. Who knows where the Zen design will bottleneck?

Just remember that BD also gained much theoretical improvements on paper in most departments (for instance, the huge BTBs, IBBs, the ROBs/PRFs and schedulers, the branch fusion, the overhauled branch predictors, 4vs3 decode, the repairing of return stack, the 1GB page and much bigger L1/2 ITLB, loop detection, the predecode/pick buffers, RIP queues, etc).

Sent from HTC 10

DrMrLordX · Jun 12, 2016

KTE said:
Oh, IBM POWER9 is using GF 14nm HP http://www.theregister.co.uk/2016/04/07/open_power_summit_power9/

I wonder why they didn't use LPP?

Wonder why AMD didn't use 14nm HP finFET for Zen/Summit Ridge?

Also, notice how POWER9 will be using eDRAM? Veeery interesting.

The Stilt · Jun 12, 2016

JDG1980 said:
I'd be rather surprised if any 32nm Piledriver chips are still being manufactured. More likely, AMD is running down their existing stockpiles. Why make more of an obsolete, uncompetitive product that no one wants?

You would be surprised :sneaky:

Hounds were manufactured until late 2012. You can find K10's assembled in late 2013. Bulldozer (Orochi Rev. B) manufactured seized in March / April 2013. After AMD has seized the manufacturing the SKUs are available for orders for 12 months.

I'd expect we'll have Vishera based SKUs available still in 2018.

Pilum · Jun 12, 2016

DrMrLordX said:
Wonder why AMD didn't use 14nm HP finFET for Zen/Summit Ridge?

That will be IBMs SOI-based 14nm FinFET, which GF only acquired in 2014 together with IBMs foundry business, far too late to be used in Zens design. GF may also be contractually obliged to not offer this process to other customers, or only offer it well after POWER9 launches.

DrMrLordX said:
Also, notice how POWER9 will be using eDRAM? Veeery interesting.

IBM has been using eDRAM since 2010 with POWER7.

dark zero · Jun 12, 2016

AMD won't use 14nm SOI Finfet on Zen... they will use it on Zen+

Arachnotronic · Jun 12, 2016

dark zero said:
AMD won't use 14nm SOI Finfet on Zen... they will use it on Zen+

No, they won't.

KTE · Jun 12, 2016

Pilum said:
That will be IBMs SOI-based 14nm FinFET, which GF only acquired in 2014 together with IBMs foundry business, far too late to be used in Zens design.

IBM has been using eDRAM since 2010 with POWER7.

Yep. And <8-way SMT.

They'll present on this at Hot Chips before AMDs presentation.

FD-SOI seems to have noticeable benefits with planar UTB FETs (according to ARMs older presentation) https://scsong.wordpress.com/2009/10/14/soi-reduces-dynamic-power-wafer-costs-coming-down/

Sent from HTC 10

Dresdenboy · Jun 12, 2016

The Stilt said:
So the SMT thread should perform the same as the native core, as long as the native core is not utilized?

Not quite. An idle thread still occupies resources between fetch and retire. Or did you have a different scenario in mind?

Also there is a problem with SMT running a low and a high priority thread on the same core, as I've done here with Prime95 (high prio) + some low priority task.

More here:
http://forums.anandtech.com/showpost.php?p=37843750&postcount=45

Dresdenboy · Jun 12, 2016

DrMrLordX said:
Sorry to repeat myself, but I agree 100% with AtenRa that Summit Ridge needs to beat XV in MT workloads on a core-per-core vs. module-per-module basis, or otherwise it'll be a disappointment. We could have had 4-8m XV last year if AMD had really wanted it. I would love to have gotten a chip like that in 2015.

If we start to expect XV module performance per Zen core+SMT (including clock frequency), we should remind ourselves, that we'd compare a 2 module processor without L3 with an 8 core processor. So there are 4 times as many threads, than with XV! More important than some module performance will be single threaded, medium threaded (2-8, e.g. games) and fully theaded performance. And the latter 2 will be defined by sustainable clocks, which are a result of power efficiency.

Pilum said:
That will be IBMs SOI-based 14nm FinFET, which GF only acquired in 2014 together with IBMs foundry business, far too late to be used in Zens design. GF may also be contractually obliged to not offer this process to other customers, or only offer it well after POWER9 launches.

The deal has just been announced in October 2014.

The press release of deal completion is dated July 1st, 2015.
http://www.globalfoundries.com/news...-acquisition-of-ibm-microelectronics-business

Last summer would indeed have been a short time to move Zen to that process.

On that process:

IBM said:
IBM will describe its SOI approach to building 14nm FinFETs, emphasizing that its insulated substrates are worth the extra money the wafers cost because of the processing steps they reduce to isolated devices from one another. Its latest 14nm FinFETs are 35% faster than its planar 22nm transistors and use an operating voltage of 0.8 volts.

http://www.eetimes.com/document.asp?doc_id=1324343

BTW is it still the case, that ex-IBM's 14nm process has a CPP = 80nm and MP = 64nm?

TheELF · Jun 12, 2016

Dresdenboy said:
Not quite. An idle thread still occupies resources between fetch and retire. Or did you have a different scenario in mind?

Lowest priority from task manager does not equal idle,you can use dos commands or process hacker to actually put a task,or thread,to idle priority.

Dresdenboy · Jun 12, 2016

TheELF said:
Lowest priority from task manager does not equal idle,you can use dos commands or process hacker to actually put a task,or thread,to idle priority.

I set the priority myself using OS calls, since that was a SW written by me.

Now would that change the situation? That CPU (Broadwell) wasn't aware of the priority. How should it?

TheELF · Jun 12, 2016

Dresdenboy said:
I set the priority myself using OS calls, since that was a SW written by me.

Now would that change the situation? That CPU (Broadwell) wasn't aware of the priority. How should it?

You're loosing me here,the OS(and not the CPU)has to be aware of the priority a task runs because the OS is the thing that manages everything.
Are you sure your OS calls really worked correctly and that the OS didn't use a different priority on it?

Dresdenboy · Jun 12, 2016

TheELF said:
You're loosing me here,the OS(and not the CPU)has to be aware of the priority a task runs because the OS is the thing that manages everything.
Are you sure your OS calls really worked correctly and that the OS didn't use a different priority on it?

Of course the OS scheduler is the first to care about that. But as can be seen in the linked old posting, I used multiple threads in the BG task too. Then it's up to the OS to hand these threads over to the known logical cores.

But if the CPU can't see a difference (e.g. by thread priority support in the CPU driver and the processor itself), it just gets threads scheduled to logical cores based on the OS scheduler's strategies and metrics.

DrMrLordX · Jun 12, 2016

Pilum said:
That will be IBMs SOI-based 14nm FinFET, which GF only acquired in 2014 together with IBMs foundry business, far too late to be used in Zens design.

While true, GF probably could have licensed the tech from IBM (along with 22nm FDSOI) long before they made final acquisition of IBM's fabs . . . sort of the same way they licensed 14nm LPP from Samsung.

IBM has been using eDRAM since 2010 with POWER7.

Interesting. I never noticed that. POWER8 has it too, it would seem.

Dresdenboy said:
If we start to expect XV module performance per Zen core+SMT (including clock frequency), we should remind ourselves, that we'd compare a 2 module processor without L3 with an 8 core processor. So there are 4 times as many threads, than with XV!

That's just Carrizo/Bristol Ridge though. We never saw 4m-8m (or more) from Excavator since AMD canned all such projects to fund Keller's group. It is reasonable to assume that AMD could have at least released a low-power series of XV chips in 2015 with more than just two modules. If they had really wanted to, they could even have licensed 22nm FDSOI from IBM to do something higher-power.

Yeah, I am glad to see that AMD finally has enough confidence to release a CPU design that can handle more than four threads. The last time they did that was 2012. Owning a desktop CPU that can double up the throughput of an 8350 does seem like fun from an AMD-buying enthusiast's point-of-view, especially when the TDP is only supposed to be 95W. Whether or not the server market will embrace such a chip remains to be seen.

TheELF · Jun 12, 2016

Dresdenboy said:
Of course the OS scheduler is the first to care about that. But as can be seen in the linked old posting, I used multiple threads in the BG task too. Then it's up to the OS to hand these threads over to the known logical cores.

But if the CPU can't see a difference (e.g. by thread priority support in the CPU driver and the processor itself), it just gets threads scheduled to logical cores based on the OS scheduler's strategies and metrics.

I don't get how you came to this conclusion.
Did you try to run prime with real-time priority?
Did the BG task actually ran at idle?Did it's child-threads?

Dresdenboy · Jun 12, 2016

TheELF said:
I don't get how you came to this conclusion.
Did you try to run prime with real-time priority?
Did the BG task actually ran at idle?Did it's child-threads?

To make it clear: I didn't want to see, what happens, if the OS doesn't schedule a second thread or a "sleeping" one to the same core. If more time becomes available, I might try to run an idling thread in the background.

Abwx · Jun 13, 2016

Exclusive: Server chip with 64 threads arriving in 2017

Fudzilla has scored some exclusive details about the coming AMD Zen chip that will have as much as 32 cores, 64 threads and more L3 cache than you can poke a stick at.

https://semiaccurate.com/forums/showpost.php?p=265571&postcount=3138

inf64 · Jun 13, 2016

AMD will supposedly demo Zen running DOOM on E3 today.

New Zen microarchitecture details

Golden Member

Lifer

Diamond Member

Golden Member

Golden Member

Senior member

Diamond Member

Senior member

Lifer

Golden Member

Member

Platinum Member

Lifer

Senior member

Golden Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Lifer

Diamond Member

Golden Member

Lifer

Diamond Member