64 core EPYC Rome （Zen2）Architecture Overview？

Zapetu · Nov 6, 2018

kokhua said:
I got 420 mm^2 for the I/O and 72 for the CCX. Close enough. The edges are hard to discern.

May I use your pictures in case I need them for illustration in future?

Ofcourse you may.

Gideon said:

They did really polish that baby up for the beauty shots...

superunknown98 · Nov 6, 2018

I agree, an intergrated DRAM controller will always be faster with less latency than off die, all things being equal. Do we have confirmation on the clock rate of the IO hub yet or even the layout? Ring or crossbar?

PeterScott said:
No you just countered with your own assumptions.

Bottom line.
A: Integrated memory Controller.
B: Off chip Memory controller (AKA Northrbridge) with cache to compensate.

You think B is Faster, I think A is faster. I am not going to waste my time arguing about it. But really I don't think everyone would have moved to IMC if it was slower.

AMD is not that cash strapped that they can't afford to build a desktop die. They are stealing significant desktop market share back from intel and hope to get more for this new generation.

We are talking about 1 die. Intel uses at least 4 dies on desktop (2 core, 4, core, 6 core, and now 8 core). Resurgent AMD can't even afford one?

AMD could also just do an 8 core APU this generation to cover the market, so all the mainstream AM4 desktop/laptop parts could cover the market with selective component disabling on just one die.

Lets stick a pin in this one and see who's correct later. I think chiplets for TR/Epyc and monolithic for Mainstream (AM4). Time will tell who got that right.

Vattila · Nov 6, 2018

Zapetu said:
Please feel free to use any of the images and draw over them to better illustrate how everything is connected together.

Thanks for your diagrams! Here is my attempt to reconcile the chip layout with my ideas about a "quad-tree" topology discussed in my CCX speculation thread (here), assuming each CPU chiplet consists of two 4-core CCXs, where each pair of chiplets forms a fully connected cluster of 4 CCXs (16 cores), and where these clusters are fully connected to each other.

HurleyBird · Nov 6, 2018

There's close to no chance that there's enough room in each chiplet for 2 distinct core complexes if there's also 32MB of L3, and even with only 16MB of L3 you certainly aren't going to find the space for that much IO.

Vattila · Nov 7, 2018

HurleyBird said:
you certainly aren't going to find the space for that much IO.

Perhaps that's why the chiplets forming each pair are mounted so close to each other? For some edge connection? Or the pair sits on top of an interposer which provides the interconnect?

The topology would offer some nice VM partition sizes; 1-4 fully connected cores within a CCX, 2-4 fully connected CCXs (8-16 cores, max two hops), 2-4 fully connected CCX clusters (32-64 cores, max three hops).

With Zen 3, on a more mature and cheaper 7nm+ process, they may replace each pair with a single 16-core die (4 CCXs) and optimise the CCX interconnect on-die.

HurleyBird · Nov 7, 2018

Vattila said:
Perhaps that's why the pairs are mounted so close to each other? For some edge connection? Or the pair sits on top of an interposer which provides the interconnect?

Distance between the dies shouldn't change too much in terms of die size consumed by IO, and there just isn't much room. I'd guess that there's only a small chance of something like a single connection between close together dies, with anything beyond that being extremely unlikely.

itsmydamnation · Nov 7, 2018

I think they are close together because of pin routing, remember this still has to be current socket compatible, remember how the layout used to work(naples).

Dram on the outside, IF/pic-e on the inside. they dont want crazy routing of all those pins in the package. So by putting the two chiplets close together the way the pins map to the dies stays close.

krumme · Nov 7, 2018

Exiting and surprising agressive. They really meant it by the "leapfrogging design teams"

What can I expect for am4?

Will old m4 mb be compatible with zen2 as on the serverside?

Nothingness · Nov 7, 2018

Glo. said:
Wait? 28% IPC increase? Am I reading this correct?

For two very specific benchmarks: RSA is integer number crunching (lots of multiplications) and DKERN seems to be some statistics FP functions. That's too restrictive to give a good estimate of real IPC improvements I'm afraid.

PotatoWithEarsOnSide · Nov 7, 2018

Even if it were best case, it does increase the chances that we'll see a 10-15% IPC increase. Combined with a 10-15% increase in clock speed, and potentially reduced latencies, it does look very promising.
I'm sure that I read that Epyc 2 will be available this quarter too. Won't be too long to wait before getting real world benchmarks.

Nothingness · Nov 7, 2018

PotatoWithEarsOnSide said:
Even if it were best case, it does increase the chances that we'll see a 10-15% IPC increase. Combined with a 10-15% increase in clock speed, and potentially reduced latencies, it does look very promising.

If the reality was that good, don't you think AMD would have shown other benchmarks beyond an FP kernel and an integer multiplication one?

Some of the microarch changes certainly look interesting, but I'll wait for reviews.

CatMerc · Nov 7, 2018

Nothingness said:
If the reality was that good, don't you think AMD would have shown other benchmarks beyond an FP kernel and an integer multiplication one?

Some of the microarch changes certainly look interesting, but I'll wait for reviews.

It's still too early to reveal so much. I'm surprised they revealed as much as they did about the core and performance.

H T C · Nov 7, 2018

Vattila said:
Thanks for your diagrams! Here is my attempt to reconcile the chip layout with my ideas about a "quad-tree" topology discussed in my CCX speculation thread (here), assuming each CPU chiplet consists of two 4-core CCXs, where each pair of chiplets forms a fully connected cluster of 4 CCXs (16 cores), and where these clusters are fully connected to each other.

Signed up to reply to this post (been a lurker for a while).

Have you seen this AdoredTV video that 1st talks about chiplets? It also refers several topologies including "butterdonuts" topology.

AdoredTV starts talking about the topologies around 12:49 mark but, for context, i suggest watching the whole video.

TheGiant · Nov 7, 2018

Nothingness said:
If the reality was that good, don't you think AMD would have shown other benchmarks beyond an FP kernel and an integer multiplication one?

Some of the microarch changes certainly look interesting, but I'll wait for reviews.

IMO the 15% IPC comes the wider core...
Generally I expect the 5% of the bridge/well/lake jump

DownTheSky · Nov 7, 2018

This guy sums it up pretty well, in plain speak and w/o going into too much detail.

bobhumplick · Nov 7, 2018

nothing about any of this says 8 core ccx's. where does everybody keep getting this stuff from. the old ryzen was on chiplet with 8 cores. but it was also 2 ccx. why do you see the same 8 core chip and take away that it will be 8 cores per ccx? how do you make that leap. and it wasnt adoredtv that first broke that story it was canard pc. everyone gives him too much credit. i like his videos and all but a lot of people who dont ordinarily care or learn about cpus watch his stuff and think he invented the cpu or something

mattiasnyc · Nov 7, 2018

what's the definition of "chiplet"? Is there a consensus?

HurleyBird · Nov 7, 2018

bobhumplick said:
nothing about any of this says 8 core ccx's. where does everybody keep getting this stuff from. the old ryzen was on chiplet with 8 cores. but it was also 2 ccx. why do you see the same 8 core chip and take away that it will be 8 cores per ccx? how do you make that leap. and it wasnt adoredtv that first broke that story it was canard pc. everyone gives him too much credit. i like his videos and all but a lot of people who dont ordinarily care or learn about cpus watch his stuff and think he invented the cpu or something

The best argument for a single CCX is that there doesn't appear to be enough room for two distinct core complexes if the 32MB L3 cache rumour that has been corroborated now by a few sources is correct. Of course 1x CCX isn't confirmed, but it probably makes the most sense. I doubt that every core is going to feature a connection to every other core though. Most likely, some will require 2 or even 3 hops.

My attempt to cram in eight dies and 32MB L3 in the space of a chiplet, which if anything is probably a bit on the aggressive end in terms of node scaling, taking into account that Zen 2 cores have more features:

Atari2600 · Nov 7, 2018

bobhumplick said:
nothing about any of this says 8 core ccx's. where does everybody keep getting this stuff from. the old ryzen was on chiplet with 8 cores. but it was also 2 ccx.

The diagrams of the IO controller (IOC) for Rome had 8 IF ports.

If each CCX were to replicate the Naples arrangement, then they'd need 16 IF ports - as L3 requests between even local CCX ran across IF.

Of course, its only a schematic - and artistic license is a possibility. But, if it does reflect the IOC design, then there are 8 IF pipes local to the socket, and that is one for each chiplet, which means each chiplet is fully cache coherent.

Now - I suppose they *could* have stuck with 4 core CCX and have the cross-traffic to the local L3 cache on a cross bar within the CCX. But since that will run at CClk rather than MEMClk, we won't give a damn - latency will be broadly similar to Intel's L3. It'd also mean they have had to design some kind of amalgamator of memory requests between the CCX and the IOC. Which doesn't make much sense.

Zapetu · Nov 7, 2018

Vattila said:
Thanks for your diagrams! Here is my attempt to reconcile the chip layout with my ideas about a "quad-tree" topology discussed in my CCX speculation thread (here), assuming each CPU chiplet consists of two 4-core CCXs, where each pair of chiplets forms a fully connected cluster of 4 CCXs (16 cores), and where these clusters are fully connected to each other.

HurleyBird said:
There's close to no chance that there's enough room in each chiplet for 2 distinct core complexes if there's also 32MB of L3, and even with only 16MB of L3 you certainly aren't going to find the space for that much IO.

Vattila said:
Perhaps that's why the chiplets forming each pair are mounted so close to each other? For some edge connection? Or the pair sits on top of an interposer which provides the interconnect?

I also agree that in order to have all of those IF-links, a small silicon interposer is required under each pair of chiplets. I don't see any interposers there but on the other hand, there's too much glue to see anything. IFOB-links (Infinity Fabric On-Package) have power efficiency (PE) of ~2 pJ/b while IFIS-links (Infinity Fabric InterSocket) have PE of ~11 pJ/b (source) or rather ~9 pJ/b (source). On-die has PE of ~0.1 pJ/b and PCIe/DDR something like ~20 pJ/b. Technologies like silicon interposer or EMIB has PE of under 1 pJ/b (source) and therefore in the future AMD should (and they probably already have) really look into those bridge chiplets.

I have no idea what the underlying topology would be either for the chiplets or the I/O-die but they both have the same amount of nodes (eight cores and eight chiplets). However I'm interested to hear any ideas anyone might have tough.

Edit: Let me rephrase that. If each chiplet has two CCXs each having four fully connected cores (crossbar topology) then what's the point of having 8 chiplets instead of 16 (4-core chiplets). If there really are silicon interposers under those pairs of chiplets then those four CCXs (on two chiplets) would be fully connected (crossbar again) on a higher level like Vattila has shown in the diagram. As long as there is enough room for all the microbumps in the chiplets (I think that normal C4 bumps are much larger) this should be somewhat possible using a silicon interposer.

The highest level is currently a bit weird because some of the connections run through the I/O die (as I understand it) and some are directly between chiplets. If the interposers would be active (containing logic and transistors) then maybe all four active interposers would again be fully connected together (crossbar once more) even on a higher level. That would be a routing nighmare probably even worst than Naples because now there's also the I/O die on the way. On top of that there would be additional IF-links from each chiplet to the I/O die. I feel like the organic package in Rome only contains 8 links from each chiplet to the I/O die and all complexity of the routing is hidden inside each node (either a chiplet or the I/O die). Choosing a good routing topology to more than 4 nodes is quite a hard problem it seems.

While Vattila's basic idea is good there are a lot of problems related to running a lot of wiring in an organic package where the Rome chiplets and I/O die sits on. Then again there are a lot of problems with 8-core CCXs also.

And then there is always the possibility that they (AMD) have switched to some kind of ring bus topology which might be fine. I think that a mesh or something like the ButterDonut topology should probably be reserved for cases with more nodes and maybe active interposers.

Vattila · Nov 7, 2018

H T C said:
Signed up to reply to this post (been a lurker for a while).

Welcome. I am honoured to have brought you out of the shadows.

Have you seen this AdoredTV video that 1st talks about chiplets? It also refers several topologies.

Yes. I am a subscriber. Those musings on AMD's topology papers and patents are very interesting, and I guess we will see great stuff from AMD in the future, as they develop their chiplet designs and packaging technology.

Regarding Rome, the seemingly leading hypothesis and obvious alternative to my "quad-tree" topology, is a star topology with all traffic (memory and cache-coherency) going through a central switch on the I/O chip (with L4 and cache-coherency tags/directory). However, that hypothesis was argued to a large degree based on having all the chiplets being equidistant from the I/O chiplet, preferably with a low-latency wide edge connection or interposer solution to keep latency down.

Now it is clear the chiplets are not connected that way, and they are curiously mounted in pairs. There is no wide and fast connection connecting the CPU chiplets to the I/O chiplet. I.e. the traffic to and from the I/O chiplet goes through relatively high-latency serial IF links on the organic package substrate.

Perhaps the design isn't as uniform as believed. Imagine that the spare IF port in each CCX cluster at the corners of my diagram is connected to a dual-channel memory controller on the I/O chip, plus a slice of L4 cache.

maddie · Nov 7, 2018

Vattila said:
Welcome. I am honoured to have brought you out of the shadows.
Regarding Rome, the seemingly leading hypothesis and obvious alternative to my "quad-tree" topology, is a star topology with all traffic (memory and cache-coherency) going through a central switch on the I/O chip (with L4 and cache-coherency tags/directory). However, that hypothesis was argued to a large degree based on having all the chiplets being equidistant from the I/O chiplet, preferably with a low-latency wide edge connection or interposer solution to keep latency down.

Now it is clear the chiplets are not connected that way, and they are curiously mounted in pairs. There is no wide and fast connection connecting the CPU chiplets to the I/O chiplet. I.e. the traffic to and from the I/O chiplet goes through relatively high-latency serial IF links on the organic package substrate.

Perhaps the design isn't as uniform as believed. Imagine that the spare IF port in each CCX cluster at the corners of my diagram is connected to a dual-channel memory controller on the I/O chip, plus a slice of L4 cache.

If you have each chiplet pair connecting from the center of the pair, then aren't the connections equidistant? Star topology can work now.

I wonder if AMD is doing this to for power savings via their old technique of "resonant mesh clocking". Chiplet to IO die distance appears uniform for all 8 chiplets and a tuned resonance will lower power needed for signalling.

Having all of the CPU traffic cross a package will burn a lot of power in normal circumstances. Something unforeseen must be happening.

IntelUser2000 · Nov 7, 2018

Atari2600 said:
Everyone moved to IMC when the alternative was going through the FSB!!

It may be faster than FSB, but its still slower. When Clarkdale put the memory controller off-die, but same package it lost performance compared to Lynnfield. Having large caches will somewhat reduce the impact, but it won't perform like having a memory controller on the CPU die.

It's a trade-off CPU designers are making because Moore's Law scaling slowed down. New processes bring low performance increase and power reduction. Density increase is great, but at increased cost per mm so not as great as it seems. So they are using the latest processes on things that benefit the most, and older ones for everything else.

But perhaps not a big one.

Excuse me, but 256MB for an SRAM is really, really big.

Atari2600 · Nov 7, 2018

IntelUser2000 said:
It may be faster than FSB, but its still slower.

It might be marginally slower than an optimised direct connection, but Zen2 will likely have lower DRAM latencies than Zen1.

Guru · Nov 7, 2018

This looks fantastic, much better than I anticipated. AMD have done it in all spheres, more cores, higher frequency, less power consumption, less memory latency and better IPC. If they can translate this to desktop it would be mind boggling.

64 core EPYC Rome （Zen2）Architecture Overview？

Member

Member

Senior member

Platinum Member

Senior member

Platinum Member

Platinum Member

Diamond Member

Platinum Member

Senior member

Platinum Member

Golden Member

Senior member

Senior member

Senior member

Junior Member

Senior member

Platinum Member

Golden Member

Member

Senior member

Diamond Member

Elite Member

Golden Member

Senior member