64 core EPYC Rome （Zen2）Architecture Overview？

Vattila · Sep 29, 2018

kokhua said:
I am still waiting for someone to offer an alternative architecture that explains why AMD would move to 9 dies instead of just staying with 4 like in Naples.

Just for fun: On a small 7nm square die, arrange 4 x 4-core CCXs in a fully-connected topology (6 links). That gives you a 16-core building block, a super-CCX with no uncore logic. Mount that super-CCX die on top of a just slightly bigger 12nm die containing all the uncore logic (IO, security processor, memory controllers, etc.). Now you have a 16-core CPU. For 64-core "Rome", mount 4 of those on a passive 28nm interposer in a fully-connected topology.

That gives you a 9-die CPU, albeit in a very implausible way.

kokhua · Sep 29, 2018

Vattila said:
Just for fun: On a small 7nm square die, arrange 4 x 4-core CCXs in a fully-connected topology (6 links). That gives you a 16-core building block, a super-CCX with no uncore logic. Now mount that super-CCX die on top of a just slightly bigger 12nm die containing all the uncore logic (IO, security processor, memory controllers, etc.). Now you have a 16-core CPU. For 64-core "Rome", mount 4 of those on a passive 22nm interposer in a fully-connected topology (6-links).

That gives you a 9-die CPU, albeit in a very implausible way.

I’ve toyed with many ideas over the last couple of months. Dig deeper and you are sure to uncover some technical implausibilties.

The biggest “breakthrough idea” for me is realizing that the “uncore” can be much more sophisticated than just dumb I/O. Lay the dies out and it looks like the CPU dies are the dumb ones, like memory chips.

Vattila · Sep 29, 2018

kokhua said:
I’ve toyed with many ideas over the last couple of months.

I am curious. Can you share a brief description of those, as well as your reasons for dismissing them?

And, by the way, I take it that you are pretty fixed in your belief that "Rome" will have 9 dies. However, if you allow some doubt about that rumour, are there other designs that you would prefer? It seems you had to work very hard to come up with something that made sense to you with 9 dies.

kokhua · Sep 29, 2018

Vattila said:
I am curious. Can you share a brief description of those, as well as your reasons for dismissing them?

Not dismissing it, it’s just a casual comment. I didn’t actually think about it deeper; you sounded like you didn’t either. It seems to me more like a packaging method than architectural change though.

Vattila said:
And, by the way, I take it that you are pretty fixed in your belief that "Rome" will have 9 dies.

Yes. I no longer have any doubt that ROME will be 9 dies. Partly because multiple credible sources say the same thing, but more because it makes complete sense now that I am able to unravel the conundrum.

Vattila said:
if you allow some doubt about that rumour, are there other designs that you would prefer?

Actually, no. I think this architecture is very neat and flexible. Almost perfect, by my own reckoning. For example, you can configure various SKUs for EPYC and TR by using more or fewer CPU dies as necessary. No artificial de-featuring or need for salvaging dies. None of the core/cache ratio, memory channels and I/O availability concerns. None of the duplicated blocks wasting die area like Naples needed. Very simple “wiring” at the package level. Also, none of the NUMA issues that caused Naples to underperform Xeon in some workloads. Supports 2P and 4P configurations with only 1 hop from any requester to any responder compared with 2P only for Naples and 2 hops.

I especially like the idea of a large L4 cache (or L3, if you remove that from the CCX and enlarge the L1/L2 instead). It would be very beneficial for many server workloads. Of course, that is just my wishful thinking. After estimating the die size, I concluded that it would not be possible to add an L4$ of meaningful size. But who knows, if they decide to move the SC to 7nm as well in Milan....

Vattila said:
It seems you had to work very hard to come up with something that made sense to you with 9 dies.

I guess like everyone else, I was initially fixated on the idea that ROME would simply follow NAPLE’s architecture except extend it to 8 CPU dies. That raised a number of difficult technical problems that I could not think of possible solutions, as I explained earlier. There’s an easier answer though: I am not as smart as AMD engineers. But that’s OK. After all, I hadn’t done any engineering work for >25 years.

PotatoWithEarsOnSide · Sep 29, 2018

Vattila, excuse my ignorance, what you described sounds like 4*(4+1), which would be 20 dies, surely?

I'll forgive your brain fart.

HurleyBird · Sep 29, 2018

PeterScott said:
28 CUs with HBM2 is a niche pipe dream. You still need an affordable APU for the majority of the market.

The beauty of an on package GPU is that you can mix-and-match parts. For example, an octal-core + 28 CU GPU + HBM2 would be amazing for a laptop chip, but for desktop it's overkill since you can expect most will use a dGPU anyway, and thus you could include a trimmed down GPU chip instead. You also have the flexibility of incrementing the GPU and CPU parts separately depending on which is ready to market first, making it easier to create refresh parts.

I wouldn't be too surprised if next gen we saw something like a 3800X and a 3800GX, the main difference being one containing an on-package GPU.

CluelessOne · Sep 29, 2018

For Ryzen, if we go with Northbridge theory, how much area are saved by cutting off the memory controller and PCH connection? Can there be enough to put another chiplet as a Northbridge? For 2 chiplet @ 4 core CCX complex and another one for the Northbridge? Can socket AM4 accommodate it?
How big is Navi anyway? Can they integrate 4 NCU to this Northbridge and putting AVX 512 decoder on it? You know so AVX 512 calculation is done on the GPU.

inquiss · Sep 29, 2018

PotatoWithEarsOnSide said:
Vattila, excuse my ignorance, what you described sounds like 4*(4+1), which would be 20 dies, surely?

I'll forgive your brain fart.

I'll forgive yours

What was said was 9 chips. 16c chip made of 4 CCXs. Each 16c has it's own uncore "chip" beneath it. 2 units so far. To get to 64 cores we need 4 of those 16 core+uncore units - thats 8 chips. The interposer below is the ninth.

moinmoin · Sep 29, 2018

CluelessOne said:
For Ryzen, if we go with Northbridge theory, how much area are saved by cutting off the memory controller and PCH connection? Can there be enough to put another chiplet as a Northbridge? For 2 chiplet @ 4 core CCX complex and another one for the Northbridge? Can socket AM4 accommodate it?
How big is Navi anyway? Can they integrate 4 NCU to this Northbridge and putting AVX 512 decoder on it? You know so AVX 512 calculation is done on the GPU.

A Zeppelin die is 212.97 mm²
The CCX on it is 44 mm² each of which 16 mm² is L3 cache, a core is 7 mm² of which 1,5 mm² is L2 cache
The dual channel memory controller is 15 mm²

A Raven Ridge die is 209.78 mm²
Counting the pixels, the CCX (with half the L3 cache) is ~40 nm²
Vega is ~62 mm²

https://en.wikichip.org/wiki/amd/microarchitectures/zen#Die

So the core logic is roughly around half of the chip size (less on Zeppelin due to all the server I/O, slightly more on the optimized APU), the rest is I/O respectively uncore. Note that for chiplets you still need some uncore on every chiplet to connect them all. As Zeppelin is primarily a server die, space efficiency is the highest in that use case. There AMD stated that going for a monolith chip would save ~10% of the space. With chiplets we can expect that space needed for additional uncore to increase further.

kokhua · Sep 29, 2018

Vattila said:
I am curious. Can you share a brief description of those, as well as your reasons for dismissing them?

OK, I re-read what you wrote carefully.

Essentially you are re-creating NAPLES architecture except that each of the 4 dies is now a standalone 16-core CPU instead of 8. This CPU is itself created from a 7nm "core" die and a 12nm "uncore" die stacked together. You do not need the 28nm passive interposer; an organic substrate will work well, just like with NAPLES. Architecturally, everything good and bad about NAPLES should apply similarly. Instead of the 2-level interconnect scheme you suggested, perhaps a 2D mesh (like in Skylake-SP) might work better.

I think the main motivation for this would be that it is significantly cheaper to make the 16C CPU from 2 dies stacked together instead of a 7nm monolithic die.

Examining the Zeppelin die layout, we see that "cores" and "uncore" take up roughly similar portion of the total die area of 213mm^2, about 50%. Let's use 110mm^2.

In the first case, you have a 7nm "cores" die and a 12nm "uncore" die. Doubling the "cores" area and dividing by a guesstimated scaling factor of 2.3, I estimate the "cores" die to be roughly 96mm^2. The "uncore" part remains at 110mm^2. Assuming USD10K wafer price and 70% yield, the "cores" die costs roughly $21.50 to make. Assuming USD8K wafer price and 80% yield, the "uncores" die cost roughly $17.50 to make. Ignoring costs of stacking the 2 together, one such CPU costs roughly ~$39 to make.

In the second case, you have a 7nm monolithic die. Doubling the "cores" area, adding the "uncore", and dividing by a scaling factor of 2.3, I estimate the die size to be roughly 145mm^2. Again, assuming USD10K wafer price and 70% yield, this monolithic CPU costs ~$33 to make.

You can probably guess what I'll pick.

Vattila · Sep 29, 2018

kokhua said:
OK, I re-read what you wrote carefully.

Actually, I was curious about your earlier ideas, since you said you have "toyed with many ideas over the last couple of months". I thought maybe you had considered stellar ideas, but dismissed them because of your certainty about the 9-die rumour.

But thank you very much for analysing my hypothetical stacked chiplet architecture. It was a convoluted attempt to reconcile my ideas about AMD's plans, with the now dominant rumour that we will have a 9-die chiplet design.

Personally, I used to be convinced that AMD — to reduce risk and ensure reliable execution of their roadmap — would make baby-steps with Zen 2 and Zen 3, optimising and building on the current Zen architecture, in particular the 4-core CCX. I figured that the rumoured "Starship" would be the replacement of "Zeppelin" (airship, so makes sense), just adding another CCX based on Zen 2, giving 12 cores per die, and 48 cores for EPYC 2. Then with Zen 3, they would move to 4 CCXs, giving 64 cores for EPYC 3.

So my 3D stacking approach is just an attempt to reconcile these seemingly sensible ideas with current rumours. But, although such 3D stacking would fit nicely with Papermaster's comment about "improvements in multiple dimensions", I guess it is too early for such ambitious bleeding-edge packaging technology.

Regarding the 4-core CCX, interconnect topology and APU strategy, see my earlier speculation threads, if you are interested and haven't seen them:

http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=search...e&c[title_only]=1&c[node]=5&c[user][0]=148888

kokhua · Sep 29, 2018

Vattila said:
Actually, I was curious about your earlier ideas, since you said you have "toyed with many ideas over the last couple of months". I thought maybe you had considered stellar ideas, but dismissed them because of your certainty about the 9-die rumour.

Sorry to disappoint. I don’t really have any original ideas. My interest is mainly to understand AMD’s competitiveness vs Intel from a product and roadmap perspective. I do this analysis to make informed investment decisions.

Vattila · Sep 29, 2018

kokhua said:
Sorry to disappoint. I don’t really have any original ideas. My interest is mainly to understand AMD’s competitiveness vs Intel from a product and roadmap perspective.

No disappointment here — far from it! I love your great attempt to unravel the mystery. And I love speculation. So thanks for contributing to that! I cannot wait for more information about AMD's 7nm plans and products.

I do this analysis to make informed investment decisions.

We're in the same boat, it seems.

kokhua · Sep 29, 2018

Vattila said:
Personally, I used to be convinced that AMD — to reduce risk and ensure reliable execution of their roadmap — would make baby-steps with Zen 2 and Zen 3, optimising and building on the current Zen architecture, in particular the 4-core CCX.

Similarly, when the 9-die rumor first surfaced, and considering the technical problems, I believed AMD would stick with the less risky approach: 4 dies, but upgraded to 16C each.

After expending much effort explaining how I arrived at the diagram, may I ask for your honest opinion:

Since this is quite a drastic departure from NAPLES architecture, do you think there’s any chance that I might be on to something? Do you see anything wrong with it technically?

Vattila · Sep 29, 2018

kokhua said:
may I ask for your honest opinion [about my hypothetical architecture]

Well, although I have a longstanding interest in x86 CPU and system design, originating in the late 1990s with Dirk Meyer's masterpiece, the "K7" Athlon, I am only a programmer with a very basic education in circuit design and computer architecture. My practical experience only amounts to designing a speedometer for a windmill in college (basically a counter circuit with a timer and display) and a doorbell playing tunes (based on an x86 microcontroller). So I cannot give you a professional opinion, I'm afraid.

However, my intuition is that it all comes down to topology, and that direct-connections and the number 4 have a key role in the Zen architecture. The current incarnation has fully connected units at four levels: the 4-core CCX, 2 CCXs in "Zeppelin", 4 "Zeppelins" in the socket, and 2 sockets per system motherboard. This hierarchical architecture is open to simple extension. At the die-level the 2 CCXs can be extended to up to 4 direct-connected CCXs, and at the system-level, it can be extended to 4 direct-connected sockets. That gives a potential for up to 256 cores in a system. See my thread about the CCX and topology for more discussion.

So, if you go back to my first reply in this thread, I questioned the topology of your design, i.e. the details about how the cores are interconnected, e.g. how many hops between cores (best, worst, average). Your schematic doesn't clarify the topology, and at first glance it may seem that the System Controller chip becomes a bottleneck, as a router between all the connected cores. So I would welcome more details about the topology. And please explain what makes it better than the simple and obvious extension of the Zen "quad-tree" topology described above. That would be helpful.

Along with topology is the issue of memory coherence. Many enthusiasts on this board are concerned about core-to-core latency, since it is an issue in many workloads (eg. games). What many may not know, is that there is no instructions in the x86 ISA for communication between cores. Inter-core communication is just an unwanted side-effect of the x86 shared-memory model, in which every write to a memory location must be updated in every cache that references it. Threads communicate through shared memory. So slowdown due to inter-core latency is often a sign of poor multi-threaded programming causing contention on shared memory and locks.

As I understand it, in systems with large number of cores, the protocol for memory coherence needs to be very clever to avoid bottlenecking on inter-core communication. In your architecture, you allude to extensions for memory coherence ("cache coherent network" and "snoop filter"). I think in super-computers they use directory schemes to minimise communication between distant cores, but I know very little about this. Any additional meat on this topic and how it affects architecture, in particular scaling, would be great.

kokhua · Sep 29, 2018

Vattila said:
So I cannot give you a professional opinion, I'm afraid.

My knowledge about computer architecture is probably shallower than yours. I am not looking for a professional opinion, that would probably require insider knowledge and I might not even comprehend. Just a common sense assessment given what is known and rumored. My aim is only to make a judgement if ROME will compete well against Intel's Cascade Lake, Cooper Lake and Ice Lake. Your response certainly goes a couple of levels deeper than I care about; my stance is that whatever architecture AMD chooses, they will know how to implement the details it in an optimal way. I derive this confidence from observing the incomprehensibly complex (to me) design choices they made in Zen. To me, Zen is a work of art and delicate balance.

Vattila said:
However, my intuition is that it all comes down to topology, and that direct-connections and the number 4 have a key role in the Zen architecture. The current incarnation has fully connected units at four levels: the 4-core CCX, 2 CCXs in "Zeppelin", 4 "Zeppelins" in the socket, and 2 sockets per system motherboard. This hierarchical architecture is open to simple extension.

4-core CCX and related topology is certainly elegant. But I'm not sure the number 4 holds any significance beyond the convenient direct-connectedness with a small number of links. If you extend it hierarchically, it seems like trading off link complexity against multi-level linkage; either way there will be additional latency. Xeons have long used ring and mesh topologies with good results. I depicted an 8C CCX in the diagram because I thought a larger unified L3$ shared across 8 cores would be beneficial and I expect (hope, rather) that 8C will become the minimum in the 7nm age. Hadn't given much thought as to how they should be connected; probably mesh if I were to guess. But a case can certainly be made for sticking with dual 4C CCX's.

P.S. I'm probably guilty of a "too-quick" response again. Haven't read your thread about CCX and topology. Didn't want to keep you waiting for a response though.

Vattila said:
Along with topology is the issue of memory coherence.

I added the snoop filter almost as an afterthought; just as I added the L4 cache as wishful thinking The logic being: with an increased number of cores, snoop traffic rises exponentially. A snoop filter helps reduce that traffic. Nothing more than that. These are well-researched topics, so I assume AMD will choose the appropriate implementation.

kokhua · Sep 30, 2018

Vattila said:
And please explain what makes it better than the simple and obvious extension of the Zen "quad-tree" topology described above. That would be helpful.

I read the thread about CCX and topology. Besides the first problem I mentioned earlier about trading off link complexity against multi-level linkage, there seems to be another problem:

As you add more levels to the quad-tree hierarchy, while the number of links remain the same at 6 for each level, the link at each successive level needs to be "fatter". For example, using the rule-of-thumb that each core requires 2.5GB/s bandwidth, we see that within the 4-core CCX, each of the 6 links should provide 2.5GB/s bandwidth. At the next level, the 6 links connecting 4 clusters of 4-core CCX's would need to provide 10GB/s each, and so on and so forth. Am I correct?

Markfw · Sep 30, 2018

kokhua said:
It may be fine for your application, but certainly not for server workloads.

So what, its not a server chip. (the 2990wx)

kokhua · Sep 30, 2018

Markfw said:
So what, its not a server chip. (the 2990wx)

I’ve clarified this. What I meant to say was an artificially crippled design is not acceptable for server CPUs.

Markfw · Sep 30, 2018

kokhua said:
I’ve clarified this. What I meant to say was an artificially crippled design is not acceptable for server CPUs.

But the 2990wx is not a server chip. Your earlier comment about 4 of 8 memory channels being disabled is also incorrect. Socket TR4 only has 4 memory channels. You keep tryng to act like the 2990wx is an EPYC chip.

Edit: Not only is 2990wx not crippled, it runs the memory faster than EPYC, mine runs at 3066. This partially makes up for 4 ccx's that do not have direct memory access. Also, you really should stay on the Rome 64 core EPYC subject. I only brought up the 2990wx when a comment was made that for a 64 core chip, some ccx's may not have great memory access, and I was pointing out that its only a small detriment, not as much as you may think, based on my experience with my 2990wx.

kokhua · Sep 30, 2018

Markfw said:
But the 2990wx is not a server chip. Your earlier comment about 4 of 8 memory channels being disabled is also incorrect. Socket TR4 only has 4 memory channels. You keep tryng to act like the 2990wx is an EPYC chip.

Apparently, you still don't understand. Let me try that clarification once again, as clearly as I can:

1. In 2990WX, 2 of the 4 dies have their memory controllers disabled. As a result, cores on those dies have to go through the other 2 dies to get access to DRAM. This extra hop results in additional memory latency. Also, as I mentioned earlier, the rule of thumb is every core needs about 2.5GB/s of memory bandwidth at 60% efficiency. Using your figures, 4 channels of DDR4-3066 is sufficient for about 24 cores. As a result, performance in some applications is lower than what it could have been if all 8 memory channels were enabled. This is why I say 2990WX is artificially crippled.

2. Socket TR4 has exactly the same number of pins as SP3; i.e. 4094. There is no problem accommodating 8 memory channels if AMD decided to do so and motherboard makers designed their boards accordingly. It is purely a marketing decision, not a technical one. In fact, I would be surprised if it wasn't conceived for 8 channels (remember X499?), but only released as 4 channels with a bunch of "reserved" pins.

3. I totally understand why AMD decided to limit Threadripper to 4 memory channels and DON'T take any issue with it:

(a) Threadripper is aimed at consumers, albeit high end consumers. So, the platform cost must be kept affordable. Supporting 8 memory channels would have required an expensive motherboard with many PCB layers, and users would be required to purchase a minimum of 8 DIMMs.

(b) Threadripper comes in 2-die and 4-die variants. In the 2-die variants, the maximum possible number of memory channels is 4. To maintain compatibility with 2-die variants and avoid confusion, it is easier to limit everything to 4 memory channels.

(c) Even without the full 8ch memory, Threadripper already beats Intel's Core X series in most if not all use cases. So there is no pressing need for 8ch; might as well keep it as a trump card for the future.

Again, what I meant to say was: artificially limiting 2990WX to 4 memory channels may be fine for it's intended application and users, but a SIMILAR APPROACH is not acceptable for EPYC.

Hope that is clear enough for you.

Markfw said:
Also, you really should stay on the Rome 64 core EPYC subject.

I always try to stay on topic. I merely replied to you when you brought up 2990WX.

One last point: if Threadripper 3 followed the architecture depicted in my diagram, the above issues no longer apply.

Markfw · Sep 30, 2018

kokhua said:
Apparently, you still don't understand. Let me try that clarification once again, as clearly as I can:

1. In 2990WX, 2 of the 4 dies have their memory controllers disabled. As a result, cores on those dies have to go through the other 2 dies to get access to DRAM. This extra hop results in additional memory latency. Also, as I mentioned earlier, the rule of thumb is every core needs about 2.5GB/s of memory bandwidth at 60% efficiency. Using your figures, 4 channels of DDR4-3066 is sufficient for about 24 cores. As a result, performance in some applications is lower than what it could have been if all 8 memory channels were enabled. This is why I say 2990WX is artificially crippled.

2. Socket TR4 has exactly the same number of pins as SP3; i.e. 4094. There is no problem accommodating 8 memory channels if AMD decided to do so and motherboard makers designed their boards accordingly. It is purely a marketing decision, not a technical one. In fact, I would be surprised if it wasn't conceived for 8 channels (remember X499?), but only released as 4 channels with a bunch of "reserved" pins.

3. I totally understand why AMD decided to limit Threadripper to 4 memory channels and DON'T take any issue with it:

(a) Threadripper is aimed at consumers, albeit high end consumers. So, the platform cost must be kept affordable. Supporting 8 memory channels would have required an expensive motherboard with many PCB layers, and users would be required to purchase a minimum of 8 DIMMs.

(b) Threadripper comes in 2-die and 4-die variants. In the 2-die variants, the maximum possible number of memory channels is 4. To maintain compatibility with 2-die variants and avoid confusion, it is easier to limit everything to 4 memory channels.

(c) Even without the full 8ch memory, Threadripper already beats Intel's Core X series in most if not all use cases. So there is no pressing need for 8ch; might as well keep it as a trump card for the future.

Again, what I meant to say was: artificially limiting 2990WX to 4 memory channels may be fine for it's intended application and users, but a SIMILAR APPROACH is not acceptable for EPYC.

Hope that is clear enough for you.

I always try to stay on topic. I merely replied to you when you brought up 2990WX.

One last point: if Threadripper 3 followed the architecture depicted in my diagram, the above issues no longer apply.

Thats more clear.

One more thing. You linked your twitter account for a picture. A more preferred method is a picture hosting site, like imgur.com. I use that, its free. Once you upload an image, you then add the below, but change the {} to []

{IMG}
https://i.imgur.com/Uy6SHhW.png
{/IMG}

The URL you get on the site by looking at your image, and then use copy on the direct link area. Below is a real example of a picture with the latency of my 2990WX (sort of on topic, as possibly what the 64 core EPYC use)

Vattila · Sep 30, 2018

kokhua said:
As you add more levels to the quad-tree hierarchy, while the number of links remain the same at 6 for each level, the link at each successive level needs to be "fatter". […] Am I correct?

Yes, of course, every link on higher levels needs to carry more traffic from the multitude of cores at lower levels (e.g. socket to socket). But you can approach this in many ways, i.e. with different topologies. I discuss this to some degree in my topology thread. In particular, the fatter links on higher hierarchical levels need not go between just two end-points (routers) — and in fact, in the Zen design, in some cases they do not. Instead, they use a router-less approach with multiple sublinks.

For example, the fat link between two EPYC sockets comprises of four sublinks, each between the corresponding dies in each socket (die 0 in socket 0 links to die 0 in socket 1, etc.). This obviates the need for a central router in each socket, which may have become a bottleneck.

On the other hand, I expect that each CCX has an Infinity Fabric controller that routes the traffic in and out of the CCX. I am not sure how the dies in the MCM package are interconnected, i.e. whether each die has a router, or there are direct sublinks between corresponding CCXs in each die (CCX 0 in die 0 links to CCX 0 in die 1, etc.).

However, in my topology thread, I speculate and calculate how many links and ports are required to use the router-less approach at every level, even for the CCXs (i.e. core 0 in CCX 0 is direct-connected to core 0 in CCX 1, etc.).

If you draw out the topology for this router-less approach it will look like a sparsely connected hypercube, I guess. But now we are in territory were I am out of my depth. I may miscategorise these things. This is why I started the topology thread in the first place. I was hoping for an expert or two to chime in.

Now, back to your architecture: How do you handle the choke point in your System Controller chip? Are there connections between the 8-core chiplets, or do all connect only to a central router (crossbar)? If the latter, how much bandwidth does it need to handle, and is that feasible?

Abwx · Sep 30, 2018

kokhua said:
In the first case, you have a 7nm "cores" die and a 12nm "uncore" die. Doubling the "cores" area and dividing by a guesstimated scaling factor of 2.3, I estimate the "cores" die to be roughly 96mm^2. The "uncore" part remains at 110mm^2.

And why should it remain at 110mm2 despite the IMC being single channel for each die if we are to follow your speculation..?

And there s still the need for a 9th chip...?

At this point your estimation is self contradictory...

Vattila · Sep 30, 2018

An 8-core Ryzen 3000 ("Matisse") engineering sample has supposedly arrived at RTG.

https://hardforum.com/threads/the-r...-has-received-its-first-zen-2-sample.1967802/

Now the questions are: Is it an APU? Or if not, is it the same die as for the rumoured 9-die "Rome"? If so, does the die include uncore logic? Or is "Matisse" a chiplet design with a separate uncore die?

So many questions.

My wild guess: "Matisse" is an APU (like Intel's Core) for mainstream desktop and high-end/gaming laptop. That's why it is at RTG for testing. The "Rome" design is actually a 5-die design very similar to the existing design, but with four 16-core dies and a separate controller chip (with L4 cache, socket-to-socket router and directory-based cache coherence for multi-socket scaling).

64 core EPYC Rome （Zen2）Architecture Overview？

Senior member

Member

Senior member

Member

Senior member

Platinum Member

Member

Member

Diamond Member

Member

Senior member

Member

Senior member

Member

Senior member

Member

Member

Moderator Emeritus, Elite Member

Member

Moderator Emeritus, Elite Member

Member

Moderator Emeritus, Elite Member

Senior member

Lifer

Senior member