64 core EPYC Rome （Zen2）Architecture Overview？

PeterScott · Nov 6, 2018

Atari2600 said:
That is not an explanation of the mechanics of why the latency will be worse beyond a high level assumption that going through an IO Controller will be markedly worse.

I have pointed out (with measured numbers) that your high level assumption does not hold up.

No you just countered with your own assumptions.

Bottom line.
A: Integrated memory Controller.
B: Off chip Memory controller (AKA Northrbridge) with cache to compensate.

You think B is Faster, I think A is faster. I am not going to waste my time arguing about it. But really I don't think everyone would have moved to IMC if it was slower.

As for why AMD won't design a desktop die, simples, cost. Any marginal (and they are very marginal) gains AMD would make in cutting DDR latency by a few percent for Ryzen7 would be lost in the mask cost and in requiring dedicated wafers for desktop rather than flexibility in wafer starts.

A redesign for Ryzen7 also means a redesign for Ryzen3. Which means 3 masks, not 1. Keeping the same 8 core unit allows the same 7nm mask to be reused and the I/O Controller switched out.

The market does not revolve around desktop, regardless of how much some folks on here get tunnel vision and think it does.

AMD is not that cash strapped that they can't afford to build a desktop die. They are stealing significant desktop market share back from intel and hope to get more for this new generation.

We are talking about 1 die. Intel uses at least 4 dies on desktop (2 core, 4, core, 6 core, and now 8 core). Resurgent AMD can't even afford one?

AMD could also just do an 8 core APU this generation to cover the market, so all the mainstream AM4 desktop/laptop parts could cover the market with selective component disabling on just one die.

Lets stick a pin in this one and see who's correct later. I think chiplets for TR/Epyc and monolithic for Mainstream (AM4). Time will tell who got that right.

Glo. · Nov 6, 2018

itsmydamnation said:
see what i want is 4x128 load and 2x128 store a cycle
what i think it will be is 2x256 and 1x256 store a cycle

What Ian wrote in the live steam could be interpreted both ways (to me anyway).

Are you sure it is not 4x256 Bit load, and 2x256 Bit store?

Tuna-Fish · Nov 6, 2018

Glo. said:
Are you sure it is not 4x256 Bit load, and 2x256 Bit store?

That would be 4x bandwidth, and they only marketed 2x.

In any case, there is little chance of more than 1x store. For the simple reason that the x86 memory model would make all the load units much more expensive if there were more than one store unit. (Every load unit must compare every load address with every in-flight store address at every store unit to maintain memory consistency.)

Atari2600 · Nov 6, 2018

PeterScott said:
No you just countered with your own assumptions.

I assume very little:

https://images.anandtech.com/doci/12625/AMD Ryzen Cache Clocks_575px.png

The latency numbers are there for you to see. DRAM latency time is an order of magnitude higher than L3. Therefore anything you do in terms of routing memory requests/returns that will operate in the same timeframe as the cache communications will do little material harm to DRAM access times.

PeterScott said:
You think B is Faster, I think A is faster. I am not going to waste my time arguing about it. But really I don't think everyone would have moved to IMC if it was slower.

Everyone moved to IMC when the alternative was going through the FSB!!

PeterScott said:
AMD is not that cash strapped that they can't afford to build a desktop die. They are stealing significant desktop market share back from intel and hope to get more for this new generation.

We are talking about 1 die.

We are talking about 2 dies, APU and mainstream desktop. Both on 7nm with expensive mask costs and with zero flexibility to move from what their original design is.

Or stick with 1 die, the 8C ccx. Then use different 12/14 nm IO designs to link them to other Chiplets/Nothing/GPU/Hybrid & the wider system.

-----------------------------------

Just thinking - is there much to stop AMD now releasing a 32 core APU for use in HPC? [if it would have any value is another matter given the 7nm Vega communicating over PCIe4 talked about at same event.]

Glo. · Nov 6, 2018

Rome: 8 Matisse Chiplets(7nm) + I/O Die(14 nm)
Matisse, desktop, AM4 platform: 1, or 2 Matisse Chiplets(7nm) + I/O die.
APUs: 1 Matisse Chiplet, I/O Die, Navi GPU Chiplet, HBM

Tell me this is not fu***** brilliant design?

inf64 · Nov 6, 2018

Glo. said:
Rome: 8 Matisse Chiplets(7nm) + I/O Die(14 nm)
Matisse, desktop, AM4 platform: 1, or 2 Matisse Chiplets(7nm) + I/O die.
APUs: 1 Matisse Chiplet, I/O Die, Navi GPU Chiplet, HBM

Tell me this is not fu***** brilliant design?

It's golden!

Gideon · Nov 6, 2018

Atari2600 said:
Or stick with 1 die, the 8C ccx. Then use different 12/14 nm IO designs to link them to other Chiplets/Nothing/GPU/Hybrid & the wider system.

Yeah, the IO Die is monstrous. I Can still see a cut down version of it easily being used for Threadripper. For AM4 there would have to be another one. And as they only need 1/4 of the IO, it would only make sense to add a GPU to that as well.

The possibilities are intriguing indeed. Maybe they could add a GPU for Threadripper 3? If for no other reason, then to speed up (Adobe) encoding tasks, that can also use parts of the GPU's dedicated encoding hardware.

EDIT:
Looking at this comparison, I Could Easily see AMD offering:

2 chiplets (16 cores) and
a smaller (1/4) I/O Die
(+ possibly even a small GPU)

as top-of-the line Ryzen On the AM4

That would of course require a memory controller that actually scales to 4+ GHz to actually utilize the cores

Gideon · Nov 6, 2018

Not needing an interposer is also a nice touch. Makes the possibility of seeing this on AM4 a lot more likely.

I also don't buy the "latencies will be worse when using chiplets" argument. Have you seen the CCX latency (in-die) on the Zen 1? The Retired Engineer™ already mentioned, that the latency hit from close MCM would be in single-digit ns. Now Ryzen has about 25-30ns worse memory latency than Coffee-Lake with similar ram. The I/O chip will almost certainly have L4 cache and it doesn't have to run interconnects at the memory clock. My bet is, that if anything, the latencies will be considerably improved

mattiasnyc · Nov 6, 2018

Gideon said:
Yeah, the IO Die is monstrous. I Can still see a cut down version of it easily being used for Threadripper. For AM4 there would have to be another one.

See, I think the most financially beneficial thing to do for AMD is probably the path of least resistance so to speak. Rather than cutting those i/o dies down all that much I'd think they'd prefer to slightly modify them to simply kill off what isn't needed, but retain the size and overall design. I don't know jack about this really but it seems to me that this is what led AMD to be able to create one die for server, hedt, desktop and apu. It's pretty brilliant. Why not do essentially the same for the i/o.

Gideon said:
And as they only need 1/4 of the IO, it would only make sense to add a GPU to that as well.

The way I see it adding that GPU might be more trouble than it's worth for the above reason.

Additionally I'd think that those who are in the market for Threadripper actually won't care that much about integrated GPUs. Certainly at that price I'd expect video editors and colorists to simply buy a dedicated GPU fro processing + a dedicated card for video output if necessary. I could be wrong about that of course.

Gideon · Nov 6, 2018

mattiasnyc said:
Additionally I'd think that those who are in the market for Threadripper actually won't care that much about integrated GPUs. Certainly at that price I'd expect video editors and colorists to simply buy a dedicated GPU fro processing + a dedicated card for video output if necessary. I could be wrong about that of course.

This is the article I was mentioning. The issue isn't the lack of a discreet GPU. Adobe Premier can use the IGP on top of CUDA (a 1080 Ti was used in GamerNexus test). It seems that some parts of the Intel HW encoder pipeline are used (not all of it, as there is no quality degragation!) or just the nearer IGP for some really latency-sensitive stuff (travelling through PCI-e will take quite some time). Whatever it is, it improves the 8700K performance significantly in a workload that Youtubers really do use.

CrazyElf · Nov 6, 2018

Manabu said:
If the reticle limit for the passive interposer is 800mm², the SC die could be up to 25x25=625mm². If we can exclude the borders, then each passive interposer would cost about $10 from a $600 300mm wafer. If not, then about $16 per passive interposer. Compared to the $60~$100 for the 625mm² 14nm SC plus ~$100 for the 8x64mm² 7nm chiplets, it isn't much. For a 300mm² 14nm SC, the SC would be $35, the full passive interposer $8 and the chiplets would continue to be $100 for 64 core.

Are you sure 800mm^2 is the limit?

https://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/3

Finally, as large as the Fiji GPU is, the silicon interposer it sits on is even larger. The interposer measures 1011mm2, nearly twice the size of Fiji. Since Fiji and its HBM stacks need to fit on top of it, the interposer must be very large to do its job, and in the process it pushes its own limits. The actual interposer die is believed to exceed the reticle limit of the 65nm process AMD is using to have it built, and as a result the interposer is carefully constructed so that only the areas that need connectivity receive metal layers. This allows AMD to put down such a large interposer without actually needing a fab capable of reaching such a large reticle limit.

What’s interesting from a design perspective is that the interposer and everything on it is essentially the heart and soul of the GPU. There is plenty of power regulation circuitry on the organic package and even more on the board itself, but within the 1011mm2 floorplan of the interposer, all of Fiji’s logic and memory is located. By mobile standards it’s very nearly an SoC in and of itself; it needs little more than external power and I/O to operate.

DisEnchantment · Nov 6, 2018

Regarding IF 2.0 with Zen 2
- The width is now doubled to 64 bits for inter die
- Data compression where possible is applied to address (see 20180052631 METHOD AND APPARATUS FOR COMPRESSING ADDRESSES )
- Data compression from die to MC (see 20180246657 - DATA COMPRESSION WITH INLINE COMPRESSION METADATA )
- If data is less than bus width 'holes' are present instead of line toggle to save power. (see 20180314655 - POWER-ORIENTED BUS ENCODING FOR DATA TRANSMISSION)
- Data compression inter processor (see 20180167082 - COMPRESSION OF FREQUENT DATA VALUES ACROSS NARROW LINKS)
Inter die is not only to another CCX but to IO Chiplet apparently according to the patents.

Lots of the interconnect/IF related patents from AMD are under made under DoE contract (FastForward-2 Memory Technology) from 2014 of 32Million USD

https://www.amd.com/en/press-releases/extreme-scale-hpc-2014nov14

Since the patents are fairly close to what AMD was implementing in Zen 2, I thought it might be worth to have a look what the engineers were thinking

Shivansps · Nov 6, 2018

What they are making makes sence for server, they already hit the wall hard with Epyc, no way to keep adding more cores to that.

But they are already des-integrating the NB and MC? Its this like going back to pre-Athlon64? Not sure if i want to see this in desktop.

inf64 · Nov 6, 2018

Shivansps said:
What they are making makes sence for server, they already hit the wall hard with Epyc, no way to keep adding more cores to that.

But they are already des-integrating the NB and MC? Its this like going back to pre-Athlon64? Not sure if i want to see this in desktop.

How about we wait and see how it performs on desktop? Or you think AMD employs subpar engineers? This is cream of the crop design effort and this Zen iteration will be the cash cow AMD has been waiting for.

HurleyBird · Nov 6, 2018

So 8 cores and 32MB of L3 on one of those chiplets actually seems like a really tight fit. Not much room at all for anything else, and certainly not enough room for 2 distinct CCXes, at least as we know them. The best I can come up with (quite rough) is:

Of course, 32MB of L3 isn't quite confirmed yet, even if a few sources seem pretty confident about it.

Manabu · Nov 6, 2018

It seems I was wrong on betting on pasive interposers this time. From the photos, it seems the same MCM with IFOP between the dies. Points to kokhua. But nobody here guessed the actual disposition of the chiplets around the IO die.

Now I believe Matisse might be indeed a single 7nm die.

CrazyElf said:
Are you sure 800mm^2 is the limit?

https://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/3

No, I just picked an approximated number from my head. But thanks, this validades what I said.

By what you are quoting, the reticle limit for that particular process should be somewhere bellow 1011mm², and they use a technique similar to what I described in the paragraph above what you quoted me to make an interposer bigger than reticle limit w/o using stitching, right?

Zapetu · Nov 6, 2018

I have to say that after watching the event live I'm even more confused than before. It was great that the chiplet design was confirmed although without any new packaging technologies (like bridge chiplets (~EMIB) or even a passive silicon interposer). Still using normal MCM packaging is probably the safest bet and we still don't know how the chiplets are connected to the main I/O die (latencies, power draw of the interconnect).

I made some more diagrams based on todays AMD New Horizon event and here's the most basic one of the AMD Rome:

Glue is not included. I did use an actual perspective corrected picture as a base and everyting should be about right. However I will take no responsibility if some of the information in the next picture is a little off:

I did't check if someone had already calculated the die sizes but those are my initial approximations.Just for reference i'm also going to include once more a picture of the AMD Naples:

If there are some major errors in my die size estimations I will correct them later. Please feel free to use any of the images and draw over them to better illustrate how everything is connected together.

I think that at some point kokhua predicted that the chiplet size would be about 74mm2 (based on double the L3 and 2x scaling), so he was pretty much exactly right about that. The I/O die is actually really large but all the EPYC products need all the memory channels and PCIe lanes. Maybe the partly defective ones will be saved for TR3s. It's hard to say anything about Ryzen3 at this point but chiplets obviously can be used for pretty much anything if the latencies are good enough.

I had some vague idea how the placement of the chiplets is based on the thermal properties of the whole package but otherwise I was pretty wrong about almost everything that I came up with myself which is fine. AdoredTV was right all along and maybe in the future we willl see some more advanced packaging methods including atleast some bridge chiplets. If Rome would have used a passive silicon interposer, stitching would have probably been required. I think that AMD plans to sell a lot of these things and therefore manufacturing costs really matter.

HurleyBird · Nov 6, 2018

Zapetu said:
I think that AMD plans to sell a lot of these things and therefore manufacturing costs really matter.

More specifically, capacity really matters given that these are going to sell way above cost in any case.

Arachnotronic · Nov 6, 2018

This is some really good stuff from AMD. Hopefully the core has some nice IPC improvements + clocks well, could very well be at the heart of my next gaming system.

kokhua · Nov 6, 2018

Zapetu said:
If there are some major errors in my die size estimations I will correct them later. Please feel free to use any of the images and draw over them to better illustrate how everything is connected together.

I got 420 mm^2 for the I/O and 72 for the CCX. Close enough. The edges are hard to discern.

May I use your pictures in case I need them for illustration in future?

Glo. · Nov 6, 2018

Estimated increase in IPC is based on AMD internal testing for “Zen 2” across microbenchmarks, measured at 4.53 IPC for DKERN +RSA compared to prior “Zen 1” generation CPU (measured at 3.5 IPC for DKERN + RSA) using combined floating point and integer benchmarks.

Wait? 28% IPC increase? Am I reading this correct?

JDG1980 · Nov 6, 2018

The I/O die is well over 400mm^2? Even on 14nm, that's far larger than I would have expected. At that size, I have to wonder if it includes an iGPU.

HurleyBird · Nov 6, 2018

29% but likely a best case scenario.

HurleyBird · Nov 6, 2018

JDG1980 said:
The I/O die is well over 400mm^2? Even on 14nm, that's far larger than I would have expected. At that size, I have to wonder if it includes an iGPU.

That would really be a great way to combat the entrenchment of CUDA.

kokhua · Nov 6, 2018

JDG1980 said:
The I/O die is well over 400mm^2? Even on 14nm, that's far larger than I would have expected. At that size, I have to wonder if it includes an iGPU.

I estimated 250-300 mm^2 without L4. The actual die size suggest maybe there will be an L4 after all. But perhaps not a big one. Say 128-256MB. We can hope.

64 core EPYC Rome （Zen2）Architecture Overview？

Platinum Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Platinum Member

Platinum Member

Senior member

Platinum Member

Member

Golden Member

Diamond Member

Diamond Member

Platinum Member

Junior Member

Member

Platinum Member

Lifer

Member

Diamond Member

Golden Member

Platinum Member

Platinum Member

Member