New Zen microarchitecture details

swilli89 · Aug 22, 2016

Arachnotronic said:
I think you're reading too much in my comment. This architecture reminds me of Haswell, that's all.

Sorry if I misunderstood then. Hopefully this IS reminiscent of Haswell, that would be an incredible start to Zen's life.

krumme · Aug 22, 2016

Dresdenboy said:
ComputerBase jumped the gun with HotChips slides:
https://www.computerbase.de/2016-08/amd-zen-architektur/

Whats your current estimate of diesize of a 8 core incl. L3?

Dresdenboy · Aug 22, 2016

krumme said:
Whats your current estimate of diesize of a 8 core incl. L3?

Between 160 and 200 mm².

With latest slide giving a clear map of a CCX, I measured it to be 38 mm² and a core+L2 at ~5.5 - 6 mm² (one more than posted yesterday).

superstition · Aug 22, 2016

Dresdenboy said:
Between 160 and 200 mm².

What's the limitation of the process?

Arachnotronic · Aug 22, 2016

superstition said:
Yeah, it's easy to see the word ripoff as a negative. It's actually a compliment.

What's the limitation of the process?

I said "riff on" not "rip off."

A link that you may find helpful:

One source of the word riff is the music world, in which it's not uncommon for a musician to take a tune s/he has heard, and then perform a little variation on that tune. It happens all the time in some jazz sessions.

To "riff on Hamlet," then, would be for a person to take a line from the Bard and play with it, explore it, have some fun with it, look at it in various ways, explore it for levels of meaning and possible connections to other concepts and ideas, and so on.

http://english.stackexchange.com/questions/225963/what-does-a-riff-on-shakespeare-mean

DrMrLordX · Aug 22, 2016

swilli89 said:
we all can admit Intel probably found the best optimized way to do things and AMD is simply adopting the same techniques.

Well, it's either that, or AMD decided that approaching the existing x86 market with something too different from Intel's offerings would probably bite them in the ass (again). People currently expect x86 CPUs to behave like Intel processors. There are too many circumstances where AMD CPUs from the last 5+ years "could have done better" but didn't thanks to stuff like FMA4, xOP, ST performance, blah blah blah.

superstition · Aug 22, 2016

Arachnotronic said:
I said "riff on" not "rip off."

You're right. My brain substituted a common expression for an uncommon one. I'll try to read more slowly.

DrMrLordX said:
but didn't thanks to stuff like FMA4, xOP, ST performance, blah blah blah.

The first two were due to AMD trying to adapt to Intel's changes concerning SSE5 specs, right?

KTE · Aug 23, 2016

I'm still curious if Zen, if primarily server centric, uses any localized interconnect buffering. That would really up the ante.

Sent from HTC 10
(Opinions are own)

Shamrock · Aug 23, 2016

I happened among this article today (published Aug. 18th)

http://venturebeat.com/2016/08/18/amds-takes-biggest-jab-at-intel-in-years-with-zen-processor/

Some informative claims, and some we already know, but a few stuck out for me.

AMD says the “breakthrough performance” of Zen can challenge Intel’s fastest processor to date — the 10-core Broadwell-# processor.

It did mention the 3Ghz test with downclocking the Intel. So, is this claiming AMD will have more than 3 Ghz?

AMD said that power would be competitive, the frequency we saw would be even higher at production than what we saw, and production of what we saw could be produced at scale.

KTE · Aug 24, 2016

^Even with equal IPC and equal power usage, AMD would need a base 4GHz Zen to challenge Intels fastest in 2016.

Forget 2017.

Sent from HTC 10
(Opinions are own)

Ancalagon44 · Aug 24, 2016

I for one quite like what has been revealed about Zen so far. I would definitely consider buying one depending on the price/performance ratio compared to Intel processors next year.

My prediction would be that, in terms of IPC, it won't beat Haswell but it will come close, depending on the benchmark. In floating point heavy code, it will be further behind than in integer heavy code. This might mean that game performance won't get the major performance boost we are hoping for.

A lot depends on the clockspeeds that they can achieve. If they can only manage around a 3GHz base clock for the top SKU, they are in trouble. If they can hit 3.4GHz base and maybe 3.7GHz turbo boost for the top SKU, I think they will be in with a very good chance. If I were AMD, I'd probably rather shoot for higher clockspeeds even if it harms efficiency. At least, definitely on the desktop CPUs.

inf64 · Aug 24, 2016

AT posted part 2 on Zen:
http://www.anandtech.com/show/10591...t-2-extracting-instructionlevel-parallelism/4

hrga225 · Aug 24, 2016

Sheduler per pipe still baffles me.Is this their solution for handling SMT?Anyone have any idea?

Edit: Or the high level diagram is confusing me?

deasd · Aug 24, 2016

An interesting reading from WCCF. He implys the SMT mechanism is much more like IBM Power than Intel Nehalem and those successors, that said utilizing more resources when executing another thread. But I take it with a grain of salt:

http://wccftech.com/amd-zen-architecture-hot-chips/#comment-2855691209

We'll have to wait for benchmarks, but I'm growing ever more suspicious that Zen's SMT implementation is more like Power8's than anything Intel's produced so far. Intel's approach has been to allow a second thread to use unused CPU resources, but doesn't really over-provision those resources (a single thread can very nearly saturate the whole CPU). On Power8, they can scale up to 8 threads per core (Zen will only do 2), but they make that viable by doubling down on key CPU resources in the first place (Instruction Cache, rename registers, etc.). The end result is that the second SMT thread on Intel increases overall performance by around 15-25%, but on Power8 the second SMT thread can increase overall performance by around 60% in some workloads. In Layman's terms, Power8's 'hyperthreads' are more useful than Intel's.

AMD haven't talked about rename registers yet, but they have revealed that the instruction cache is 64KB per core; perhaps not-so-coincidentally, that's double the size of Skylake's instruction cache, and the same size as Power8's. The L1 Data cache is only 32K in all of these processors, but its rather odd in processor design to have your instruction cache be twice the size of your L1 data cache -- unless you have a good reason. There's only two reasons I can think of -- either that second thread chews through a lot more instructions than in competing SMT designs, or possibly the uOp Cache can spill to L1. Looking at the slide from HotChips that shows which CPU resources are exclusive, competitively shared, or arithmetically arbitrated, has me leaning toward the former, though they might not have overprovisioned CPU resources enough to match Power8 fully. There were also rumors months back about Zen doing some really novel things with SMT, which would seem to back that up.

The implication of that would be that Zen could run at a lower clockspeed than Intel's current Broadwell DE but still match in overall threaded performance (but perhaps giving up 10-15% single-threaded performance (not clock-normalized)). For the mainstream, they could release a quad-core CPU at similar clocks to Skylake, and outperform it in threaded workloads. In gaming workloads, since current consoles make 6-7 threads available to games, a quad-core Zen with 4 hyper-threads giving ~60% additional performance would give a lot bettter performance than a quad-core i7 with 4 hyper-threads giving ~20% additional performance. In fact, that Zen would would have a throughput comparable to 6-7 dedicated cores.

We won't know until someone does an architecture deep-dive or we have benches showing SMT gains much larger than intel's. But its looking increasingly likely from what I see.

But I partly agree with him that L1-inst is unreasonably as 2 times large as L1-data, this is very similar to Excavator, and even Bulldozer has 64:16 ratio of L1 inst/data, which have much more resources and even dedicated pipeline for another 'thread'.

hrga225 · Aug 24, 2016

deasd said:
But I partly agree with him that L1-inst is unreasonably as 2 times large as L1-data, this is very similar to Excavator, and even Bulldozer has 64:16 ratio of L1 inst/data, which have much more resources and even dedicated pipeline for another 'thread'.

Yes,that part is a given,but performance figures in conclusion are bit over the top.He forgets that Power is SMT4,ofcourse,firing 2nd thread gives huge boost in performance.

Tuna-Fish · Aug 24, 2016

hrga225 said:
Sheduler per pipe still baffles me.Is this their solution for handling SMT?Anyone have any idea?

The architecture of the schedulers has no impact on SMT. At that point, the uops coming in are already effectively anonymous -- the schedulers don't know or care from which thread they are from.

Having many 1-way schedulers has one big disadvantage -- each uop is assigned to scheduler on dispatch, and each scheduler can only execute one op per clock. Imagine a situation like:

1: add eax, ebx
2: add ecx, eax
3: add edx, eax

both 2 and 3 depend on the result of 1, but are independent of each other. If they all get assigned to a single scheduler, this piece of code takes 3 cycles to run, while a monolithic scheduler could run 2 and 3 simultaneously on different execution units. This disadvantage is not as bad as it sounds, as the simple solution of just spreading workload evenly across the queues is optimal most of the time. It does occasionally lose a cycle or two to scheduling losses, though.

The main advantage is that they are much easier to make larger, easier to clock higher and use less power. Intel Haswell has 60-entry scheduler that is shared between all instruction types. Zen has 6x14 entry for integer, plus unannounced separate queue for FP. The total Zen scheduler window is likely even bigger than the 97-instruction Skylake.

majord · Aug 24, 2016

Got sick of sifting through so many slides, so attempted to compile into one and add some details..

Could do with more, and probably muffed something up, but i'm tired.. so bed time

hrga225 · Aug 24, 2016

Tuna-Fish said:
The architecture of the schedulers has no impact on SMT. At that point, the uops coming in are already effectively anonymous -- the schedulers don't know or care from which thread they are from.

Having many 1-way schedulers has one big disadvantage -- each uop is assigned to scheduler on dispatch, and each scheduler can only execute one op per clock. Imagine a situation like:

1: add eax, ebx
2: add ecx, eax
3: add edx, eax

both 2 and 3 depend on the result of 1, but are independent of each other. If they all get assigned to a single scheduler, this piece of code takes 3 cycles to run, while a monolithic scheduler could run 2 and 3 simultaneously on different execution units. This disadvantage is not as bad as it sounds, as the simple solution of just spreading workload evenly across the queues is optimal most of the time. It does occasionally lose a cycle or two to scheduling losses, though.

The main advantage is that they are much easier to make larger, easier to clock higher and use less power. Intel Haswell has 60-entry scheduler that is shared between all instruction types. Zen has 6x14 entry for integer, plus unannounced separate queue for FP. The total Zen scheduler window is likely even bigger than the 97-instruction Skylake.

Thank you!

Dresdenboy · Aug 24, 2016

hrga225 said:
Sheduler per pipe still baffles me.Is this their solution for handling SMT?Anyone have any idea?

Edit: Or the high level diagram is confusing me?

The separate schedulers should be fine, as this simplifies the design, allows higher frequencies, and saves power. And integer/AGU ops usually have low latencies, so that each scheduler doesn't need to look at too many ops to fill a latency induced gap. FP instructions have longer latencies, thus the FPU has a unified scheduler.

hrga225 · Aug 24, 2016

Dresdenboy said:
The separate schedulers should be fine, as this simplifies the design, allows higher frequencies, and saves power. And integer/AGU ops usually have low latencies, so that each scheduler doesn't need to look at too many ops to fill a latency induced gap. FP instructions have longer latencies, thus the FPU has a unified scheduler.

Yes,I am aware of that(with little help from Tuna-fish and yours,like your blog btw.and majord's diagram)now.My confusion was also from mix up in terminology;dispatcher-sheduler.

Ajay · Aug 24, 2016

KTE said:
^Even with equal IPC and equal power usage, AMD would need a base 4GHz Zen to challenge Intels fastest in 2016.

Forget 2017.

Sent from HTC 10
(Opinions are own)

Yeah, 3.75 GHz assuming perfect scaling. I wish AMD really had a choice for Foundry. Zen looks like an excellent achievement by AMD with the potential for some solid wins if marketing does it's job. But, having come this far, AMD really needs to be able to push uArch updates through quickly to maintain pace. That, and, GFL needs to ramp up it's 14nm process for higher clocks as well. It's allot to ask, but it sure would be nice to have a competitive choice in CPUs. Can't wait till 1Q17 to see real benchmarks!

krumme · Aug 24, 2016

Now we have established some kind of agreement that a 95w tdp ~180mm2 cpu on a low freq process is not 2% faster than a 140w tdp 240mm2 die on a high freq process costing 1100 usd.

It would be a bit more interesting - to say the least - if we got some info about the efficiency.
The writes about it is very slim imo in relation to its crucial importance.

Abwx · Aug 24, 2016

krumme said:
It would be a bit more interesting - to say the least - if we got some info about the efficiency.
The writes about it is very slim imo in relation to its crucial importance.

Sous Blender, qui profite pleinement du multi-threading, Zen était légèrement devant tout en consommant un petit peu moins selon AMD

http://www.hardware.fr/news/14749/amd-dit-peu-plus-zen.html

Litteraly translated :

Under Blender, wich benefit fully from multithreading, Zen was slightly ahead while consuming a little less according to AMD.

Ajay · Aug 24, 2016

Yes, real benchmarks would be very usefully. I don't expect to get them yet on ES silicon, but we could get lucky of the next couple of months.
Throughput, in terms of an Apple-to-Apples comparison running SPEC compared to an 8 core BW-E (or Xeon) is what we ultimately need. At release, we'll get all sort of benches on various apps, games etc. So, the waiting continues.

AtenRa · Aug 24, 2016

New Zen microarchitecture details

Golden Member

Diamond Member

Golden Member

Platinum Member

Lifer

Lifer

Platinum Member

Senior member

Golden Member

Senior member

Diamond Member

Diamond Member

Member

Senior member

Member

Golden Member

Senior member

Member

Golden Member

Member

Lifer

Diamond Member

Lifer

Lifer

Lifer