64 core EPYC Rome (Zen2)Architecture Overview?

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

kokhua

Member
Sep 27, 2018
86
47
91
And why should it remain at 110mm2 despite the IMC being single channel for each die if we are to follow your speculation..?

And there s still the need for a 9th chip...?

At this point your estimation is self contradictory...

You should read the whole thread before commenting. Here, I was discussing the "stacked die" suggested by Vattila (duplicated below for your convenience). Also, please look at my diagram again; there is no memory controller on each of the CPU dies; all the 8 memory controllers are moved to the System Controller die.

Just for fun: On a small 7nm square die, arrange 4 x 4-core CCXs in a fully-connected topology (6 links). That gives you a 16-core building block, a super-CCX with no uncore logic. Mount that super-CCX die on top of a just slightly bigger 12nm die containing all the uncore logic (IO, security processor, memory controllers, etc.). Now you have a 16-core CPU. For 64-core "Rome", mount 4 of those on a passive 28nm interposer in a fully-connected topology.

That gives you a 9-die CPU, albeit in a very implausible way.
 

Abwx

Lifer
Apr 2, 2011
11,167
3,862
136
You should read the whole thread before commenting. Here, I was discussing the "stacked die" suggested by Vattila (duplicated below for your convenience). Also, please look at my diagram again; there is no memory controller on each of the CPU dies; all the 8 memory controllers are moved to the System Controller die.

Well, if i were aware that you said that there s no IMC at all then i would have found your estimation of the uncore being unchanged at 110mm2 even less plausible...

Assuming that there s no IMC induce that there s more room to increase the core count, basicaly they could host 24 cores for each 210mm2 die.
 

kokhua

Member
Sep 27, 2018
86
47
91
Well, if i were aware that you said that there s no IMC at all then i would have found your estimation of the uncore being unchanged at 110mm2 even less plausible...

Assuming that there s no IMC induce that there s more room to increase the core count, basicaly they could host 24 cores for each 210mm2 die.

My friend, please read things in context. The 110mm^2 pertains to Vatilla's suggested architecture, not mine.
 

yuri69

Senior member
Jul 16, 2013
437
717
136
Okay, time for a reality check.

Let's lay down a base context first:
* AMD's R&D budget has been constrained a lot in past 10 years
* AMD started working on znver2 probably in 2013 or 2014
* Naples was the first iteration of the Zen server line - its roots are back to 10h Magny Cours
* Naples use Zeppelin die which is reused in client Ryzen and Threadripper; Zeppelin CCX is reused in Raven APU
* back in summer 2017 there was no Rome but a Starship instead - 48c/96t znver2
* later in 2017 the reliable Canard PC specified "EPYC 2" as 64c, 256MB L3 (4MB per core), PCIe4
* nowadays Charlie@SA has been happy with current Rome config, AMD is confident, etc.

So Rome seems to be a 64c chip or a bit less likely a 48c one.

Now, let's introduce the current "rumor mill" favorite plan aka chiplets. According to a youtuber, the Rome top SKU consists of 9 chips - 1 IO and 8 compute. Details are sparse, but it seems the IO chip would be manufactured at an older process than the compute ones. This idea was further detailed in the diagram posted by OP.

== Naples scaled ==
* double L3 per core - Keep the traffic levels down.
* 8 cores in a CCX - The core interconnect can't probably be a Nehalemish xbar but for instance a SandyBridge's ring bus or whatever. It adds complexity (as in Sandy in 2011) and requires a special CCX for APUs.
* 2 CCXs on a die - This opens up possibilities for a nice TR and scaled down Ryzens. At the same time it keeps the level of complexity down - identical CCXs. Uniform intercore latency for ubiquitous 8c is a nice bonus.
* 4 dies on a package - Simply keep the socket, NUMA mapping, etc. the same.

=> Major investments are: new CCX for APUs, redone intra CCX interconnect and cutting-down Ryzens.

== The chiplets ==
* 8 cores in a CCX - The same issues as above. 8c intercore latency also the same.
* New type of "low latency" interconnect - low latency, a super-high power efficiency (all traffic past L3 goes out of the chip, back to the IO chip, then to RAM) => R&D
* The IO ccMaster - dealing with traffic from all 64c at low latency => R&D
* L4 - R&D
* IO chip itself - can it be reused for ordinary Ryzens - 1x IO + 1x compute? Wasting server-grade IO and L4 for desktop? A different die?

=> Major investments: ???

Now, it's time to lay the Occam's razor. The chiplet solution vs an ordinary one.

Does it make sense to throw away the Magny-Naples know-how given the budget? Mind you, this was really a decision made back in ~2014 (the times when Kaveri struggled with its crippled fw).

Does it make sense to reject znver1 and go to a super radical design which nobody has ever tried in x86 world with an evolution arch revision (znver2)?

Are you sure you can justify the power when going in/out to NB all the time? The same for minimal latency. Can you scale the IO ccMaster, etc.?

Are the benefits worth it? UMA, yields, etc.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,863
3,417
136
Still think these uncore die people are more crazy then birthers.

1. scaling IO with core count makes sense ( more dies , more IO), having a one size fits all uncore does not.

2. There are still vast improvements to cache coherency AMD can make that will have a big impact on inter CCX latency, they can make distributed home Agents, Move home Agents from memory controller to CCX, run the interfaces at fast speeds etc

3. R&D , where exactly is the massive increase in funding to match this, AMD is only just returning form their crippling lack of R&D, 7nm costs are much higher then 14/16nm and they have console/GPU to worry about as well as they have neglected GPU R&D for a good 3 years.

4.This uncore die strategy means a 2 die a year plan becomes minimum 4 , APU, performance desktop, uncore die, server die.

5. Executing data is cheap, moving data is expensive, the original exascale white papers where around making movement of data cheaper, GPU chiplets, CPU chiplets, stacked memory, active interposer . big reductions in power to transmit, this uncore die does none of that. Most of all its HPC targeted not general compute.

6. Big Fat L4's like people are talking about aren't free, you either waste more power or add latency to memory access (check at the same time as checking memory or wait for L4 result and then check) , if big caches just cost silicon are where mega awesome for performance you would see far more then the "failed" crystalwell

7. This plan Hurts worst case performance a lot for almost no gain, it also increases power per memory request.

8. AMD aren't going to be in anyway die size constrained nor is 200 mm sq is going to be a problem for yields

200mm a die, 4x4 ccx a die, 16mb per L3 , 32 lanes of 25/50ghz SERDES a die , improved coherency mechanisms, simple drop in compatibility to all the existing Sockets.
 

yuri69

Senior member
Jul 16, 2013
437
717
136
Yea, this sounds like the cheapest evolution. Put 4x4 CCXs per a die along with improvements in cc. This would probably yield a "good enough" scaling w/o any CCX related risk.

Redoing the CCXs to full 8c would be more ambitious/risky. Dunno about its risk/benefit ratio.
 
Reactions: Vattila

kokhua

Member
Sep 27, 2018
86
47
91
Okay, time for a reality check.

Let's lay down a base context first:
* AMD's R&D budget has been constrained a lot in past 10 years
* AMD started working on znver2 probably in 2013 or 2014
* Naples was the first iteration of the Zen server line - its roots are back to 10h Magny Cours
* Naples use Zeppelin die which is reused in client Ryzen and Threadripper; Zeppelin CCX is reused in Raven APU
* back in summer 2017 there was no Rome but a Starship instead - 48c/96t znver2
* later in 2017 the reliable Canard PC specified "EPYC 2" as 64c, 256MB L3 (4MB per core), PCIe4
* nowadays Charlie@SA has been happy with current Rome config, AMD is confident, etc.

So Rome seems to be a 64c chip or a bit less likely a 48c one.

Now, let's introduce the current "rumor mill" favorite plan aka chiplets. According to a youtuber, the Rome top SKU consists of 9 chips - 1 IO and 8 compute. Details are sparse, but it seems the IO chip would be manufactured at an older process than the compute ones. This idea was further detailed in the diagram posted by OP.

== Naples scaled ==
* double L3 per core - Keep the traffic levels down.
* 8 cores in a CCX - The core interconnect can't probably be a Nehalemish xbar but for instance a SandyBridge's ring bus or whatever. It adds complexity (as in Sandy in 2011) and requires a special CCX for APUs.
* 2 CCXs on a die - This opens up possibilities for a nice TR and scaled down Ryzens. At the same time it keeps the level of complexity down - identical CCXs. Uniform intercore latency for ubiquitous 8c is a nice bonus.
* 4 dies on a package - Simply keep the socket, NUMA mapping, etc. the same.

=> Major investments are: new CCX for APUs, redone intra CCX interconnect and cutting-down Ryzens.

== The chiplets ==
* 8 cores in a CCX - The same issues as above. 8c intercore latency also the same.
* New type of "low latency" interconnect - low latency, a super-high power efficiency (all traffic past L3 goes out of the chip, back to the IO chip, then to RAM) => R&D
* The IO ccMaster - dealing with traffic from all 64c at low latency => R&D
* L4 - R&D
* IO chip itself - can it be reused for ordinary Ryzens - 1x IO + 1x compute? Wasting server-grade IO and L4 for desktop? A different die?

=> Major investments: ???

Now, it's time to lay the Occam's razor. The chiplet solution vs an ordinary one.

Does it make sense to throw away the Magny-Naples know-how given the budget? Mind you, this was really a decision made back in ~2014 (the times when Kaveri struggled with its crippled fw).

Does it make sense to reject znver1 and go to a super radical design which nobody has ever tried in x86 world with an evolution arch revision (znver2)?

Are you sure you can justify the power when going in/out to NB all the time? The same for minimal latency. Can you scale the IO ccMaster, etc.?

Are the benefits worth it? UMA, yields, etc.

Let me make sure I understand you correctly before I comment. You are saying that because of R&D budget constraints and according to your estimated timeline, it makes more sense that AMD will pursue an evolutionary approach for ROME by scaling the NAPLES architecture as follows:

1. Stick to the current 4-dies architecture, each die a standalone CPU reuseable for Ryzen.
2. Move from a 4-core CCX to 8-core CCX, changing the intra-CCX core interconnect to a ring or mesh topology.
3. Double the per-core L3 cache from 2MB to 4MB

Your objections to the diagram I drew are:

1. R&D needed for a new type of low latency interconect for the CPU-to-System Controler link
2. R&D needed for "ccMaster" to deal with traffic from 64C at low latency
3. R&D needed for L4 cache
4. System Controller chip cannot be reused for Ryzen efficiently.

Am I correct? Did I leave out anything?
 

jpiniero

Lifer
Oct 1, 2010
14,835
5,452
136
8. AMD aren't going to be in anyway die size constrained nor is 200 mm sq is going to be a problem for yields.

I thought 4x4 was the way AMD was going to go too, but 7 nm yields are likely not great and the wafers very expensive, not to mention the design costs.... It may have been cheaper to do the 12 nm Uncore die method.

Plus there's the factor of having to continue to deal with the 12 nm WSA (even if the 7 nm is no more).
 
Reactions: Vattila

itsmydamnation

Platinum Member
Feb 6, 2011
2,863
3,417
136
I thought 4x4 was the way AMD was going to go too, but 7 nm yields are likely not great and the wafers very expensive, not to mention the design costs.... It may have been cheaper to do the 12 nm Uncore die method.

Plus there's the factor of having to continue to deal with the 12 nm WSA (even if the 7 nm is no more).
You have completely contracted yourself, the uncore way is significantly more complex, requires significantly more R&D and only removes components that are low risk of defects (physical interfaces) or things where AMD only needs to get one die with them working (USB, SATA etc) .

filling wafer supply agreement will be easy given they have the China SOC and plenty of existing 12/16nm products , APU and GPU. ( thats assuming it isn't already null and void thax to 7nm failure)

If Apple can put 100mm off the bat, first 7nm chip in iphones to day right now with almost no binning, no partial dies etc. AMD can put 200mm in a chip that launches in 6 months time on a less dense process and can be both binned and salvaged.
 
Last edited:

Abwx

Lifer
Apr 2, 2011
11,167
3,862
136
My friend, please read things in context. The 110mm^2 pertains to Vatilla's suggested architecture, not mine.

The 110mm2 i was talking about are these ones :

. The "uncore" part remains at 110mm^2..

And my answer was :

Well, if i were aware that you said that there s no IMC at all then i would have found your estimation of the uncore being unchanged at 110mm2 even less plausible...

Assuming that there s no IMC induce that there s more room to increase the core count, basicaly they could host 24 cores for each 210mm2 die.
 

kokhua

Member
Sep 27, 2018
86
47
91
Still think these uncore die people are more crazy then birthers.

I suggest you reserve judgement and avoid calling others names. The truth will be out in a few months, there's no hurry. With this diagram, I stuck my neck out, in the hope of triggering deeper discussions. There are a million ways to do the same thing. In all likelihood I will be completely wrong and become a laughing stock. But I'm OK with that.

1. scaling IO with core count makes sense ( more dies , more IO)....

How exactly? Move to 16 memory channels and 256 PCIe lanes? What about socket compatibility? It would be nice if you can draw a diagram to illustrate what you mean, like I did.

2. There are still vast improvements to cache coherency AMD can make that will have a big impact on inter CCX latency, they can make distributed home Agents, Move home Agents from memory controller to CCX, run the interfaces at fast speeds etc

How does cache coherency affect inter-CCX latency? What do you mean by "(distributed) home agents"? And doesn't running interfaces at fast speeds burn more power too?

3. R&D , where exactly is the massive increase in funding to match this, AMD is only just returning form their crippling lack of R&D, 7nm costs are much higher then 14/16nm and they have console/GPU to worry about as well as they have neglected GPU R&D for a good 3 years.

4.This uncore die strategy means a 2 die a year plan becomes minimum 4 , APU, performance desktop, uncore die, server die.

Read my earlier comment (post #118, pg. 5). AMD only needs to design 3 unique dies for a complete CPU product stack that will deliver a knock-out to Intel from top to bottom in 2019. Is that so prohibitively expensive?

5. Executing data is cheap, moving data is expensive, the original exascale white papers where around making movement of data cheaper, GPU chiplets, CPU chiplets, stacked memory, active interposer . big reductions in power to transmit, this uncore die does none of that. Most of all its HPC targeted not general compute.

Read my explanation of how I arrived at the diagram again (post #55, pg. 3). I did not rule out the use of silicon interposers; only that it might not be necessary in this case. Interposers (especially active ones), stacked memory, etc. are expensive.

6. Big Fat L4's like people are talking about aren't free, you either waste more power or add latency to memory access (check at the same time as checking memory or wait for L4 result and then check) , if big caches just cost silicon are where mega awesome for performance you would see far more then the "failed" crystalwell

You must have missed it; I repeated several times that the L4 cache in my diagram is just wishful thinking. I don't agree that L4 caches are useless, especially for server workloads. Your opinion as as good as mine though, so I'll leave it there.

7. This plan Hurts worst case performance a lot

How does it hurt worst case performance? Care to elaborate?

........for almost no gain.

I listed a number of advantages of this architecture. Please see post #129, pg. 6

.....it also increases power per memory request.

I acknowledge this point. But it may not be as bad as you think. Remember this architecture removes the need for 6 criss-crossing IF links needed for interconnecting the 4 dies in NAPLES architecture. These links use (relatively) high power SERDES's to drive (relatively) long traces. For ROME, these links are also expected to run at 25MT/s vs 10.7 MT/s in NAPLES, so there's some more power saved here.

8. AMD aren't going to be in anyway die size constrained nor is 200 mm sq is going to be a problem for yields

200mm a die, 4x4 ccx a die, 16mb per L3 , 32 lanes of 25/50ghz SERDES a die , improved coherency mechanisms, simple drop in compatibility to all the existing Sockets.

I don't disagree with this point. Note that this architecture is also 100% drop-in compatible with existing sockets. Additionally, it adds a new possibility: 4-socket support (NAPLES is limited to 2P).

Finally, just to repeat once again: my diagram is premised on the 9-die rumor being true. I've said that I believe they are true because multiple reliable sources say the same thing, and more importantly I can see many advantages that this architecture provides over a "scaled' NAPLES. It is my judgement and opinion that AMD will be bold enough to make radical changes to the architecture in order to leapfrog Intel. Remember, ROME was designed to compete with Sky Lake, not Cascade Lake or the hack-job Cooper Lake. Prior to the 64C/9-die rumor, I also thought ROME would be a small evolutionary step from NAPLES.
 
Last edited:

HurleyBird

Platinum Member
Apr 22, 2003
2,725
1,342
136
1. scaling IO with core count makes sense ( more dies , more IO), having a one size fits all uncore does not.

I disagree completely. You're limited by the socket, so on the one end you reduce the amount of redundant IO. Or going the other way you increase flexibility because it allows you to differentiate things in a decoupled manner. For example, there are eight core Epyc SKUs that are still comprised of 4 dies because certain applications don't need high core counts, where from a cost and power perspective one uncore die combined with one CPU die would be far superior.

2. There are still vast improvements to cache coherency AMD can make that will have a big impact on inter CCX latency, they can make distributed home Agents, Move home Agents from memory controller to CCX, run the interfaces at fast speeds etc

Agreed.

3. R&D , where exactly is the massive increase in funding to match this, AMD is only just returning form their crippling lack of R&D, 7nm costs are much higher then 14/16nm and they have console/GPU to worry about as well as they have neglected GPU R&D for a good 3 years.

Also a good point.

4.This uncore die strategy means a 2 die a year plan becomes minimum 4 , APU, performance desktop, uncore die, server die.

Not a good point. Besides missing the Chinese APU, AMD, Nvidia, and Intel all seem to have no issues proliferating dies when it makes sense. From a cost perspective, lowering the die sizes on the 7nm CPU dies makes a whole lot of sense.

5. Executing data is cheap, moving data is expensive, the original exascale white papers where around making movement of data cheaper, GPU chiplets, CPU chiplets, stacked memory, active interposer . big reductions in power to transmit, this uncore die does none of that. Most of all its HPC targeted not general compute.

While in principle you have a point, I'm not sure if your assumptions make sense. Having all the memory controllers and one big L4 in a single place, connected to the other dies via interposer, surely has benefits over the current distributive approach when it comes to power requirements.

6. Big Fat L4's like people are talking about aren't free, you either waste more power or add latency to memory access (check at the same time as checking memory or wait for L4 result and then check) , if big caches just cost silicon are where mega awesome for performance you would see far more then the "failed" crystalwell

You add latency to memory access of course, but pulling data from the L4 is obviously going to be cheaper and faster than main memory. If the hit rate is high enough, then you come out with a win. And it's difficult to imagine that wouldn't be the case.

Crystalwell by the way was awesome, and provided a decent IPC boost for some loads, especially gaming. But why, from Intel's perspective, would you spend that many transistors to improve an area where you already beat AMD? And a big unified off chip cache makes 8x as much sense when you have 8 big pools of cache that are distinct and somewhat redundant, as is the case with current gen Epyc. AMD is using a lot of transistors on that 64MB of L3 in current gen Epyc, which performs more like 8MB, and is sure getting a lot less out of it than Intel's 38.5MB L3 in their 28-core parts.

If the 9 chip rumour is false then the least I expect is that AMD will unify the L3 cache so that it's shared between all CCXs on a chip, which by itself would halve the amount of redundancy. Going to 4 CCXs without unifying the L3 would be madness and double the number of redundant transistors from current Epyc. 256MB of L3 cache that behaves like 16MB would be such a huge fail.

7. This plan Hurts worst case performance a lot for almost no gain, it also increases power per memory request.

Unlikely that either of those assertions are true.

8. AMD aren't going to be in anyway die size constrained nor is 200 mm sq is going to be a problem for yields

200mm a die, 4x4 ccx a die, 16mb per L3 , 32 lanes of 25/50ghz SERDES a die , improved coherency mechanisms, simple drop in compatibility to all the existing Sockets.

Yeah, I don't see anything stopping them from using a traditional approach either. I think you're a bit out to lunch when it comes to thinking that the 9 chip approach is somehow worse, but everything else being equal a refined version of the existing approach makes sense.

That said, I think your one good argument, and it's a really good one, against the 9 chip solution is that it seems way too aggressive from an R&D perspective and the timeline would place that decision at a point where pursuing it feels pretty close to betting the farm.

On the other hand, the best argument for the 9 chip solution is the sheer amount of corroborating evidence and chatter from reliable sources concerning it. I personally hope that the 9 chip rumour is genuine, but if I had to place money on one or the other right now, I think I'd just go ahead and flip a coin.
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,863
3,417
136
How exactly? Move to 16 memory channels and 256 PCIe lanes? What about socket compatibility? It would be nice if you can draw a diagram to illustrate what you mean, like I did.
You just cant add more cores to a socket willy nilly anyway, you need to have both the pins to deliver power to it and the power delivery already in the motherboard design.
I don't need to draw a diagram, if AMD wanted to increase packaging complexity and deploy a new socket they could package more in chip dies using the existing ratio's. new DDR and PCI-E standard will keep providing more bandwidth per pin allowing each die to continue increasing core counts.

How does cache coherency affect inter-CCX latency? What do you mean by "(distributed) home agents"? And doesn't running interfaces at fast speeds burn more power too?
Currently the home agents for tracking memory requests are the memory controller the data resides on, this means any data that needs to transfer between CCX's needs to check where it is from the memory controller that data ultimately resides on, if CCX A in on die A and CCX B on die B annd memory resides on die C you can see the problem. Distributed home agents means generally that the home agent closest to the data is responsible for tracking that data.

This was the big change intel made when moving to skylake SP/EP/whatever and go look at the complex memory workload benchmark data to see the result. Except intel have a home agent per Core which is actually a massive amount of logic, AMD could do one per CCX.

In terms of interface speeds to power cost you would have to try and get the data from synopsis Designware PHY and compari the 12.5G with 25/50G.

Read my earlier comment (post #118, pg. 5). AMD only needs to design 3 unique dies for a complete CPU product stack that will deliver a knock-out to Intel from top to bottom in 2019. Is that so prohibitively expensive?
It all comes down to cost, remember you have the built and entire SKU stack, whats your costs look like on a salvage 4 or 6 core part.


Read my explanation of how I arrived at the diagram again (post #55, pg. 3). I did not rule out the use of silicon interposers; only that it might not be necessary in this case. Interposers, stacked memory, active interposers, etc. are expensive.
I've read it, doesn't change my point, your making things cost more power and more latency all memory requests get an extra hop in each direction. If you can make it go faster (architectural design around memory access) you can do the same thing with out the uncore die and be even faster again.

You must have missed it; I repeated several times that the L4 cache in my diagram is just wishful thinking. I don't agree that L4 caches are useless, especially for server workloads. Your opinion as as good as mine though, so I'll leave it there.
Its not wishfull its just generally not worth it otherwise we would be seeing dirt cheap 45/32nm SOI edram caches packaged with lots of CPU's but we dont.


How does it hurt worst case performance? Care to elaborate?
Worse case performance is always a cache miss/dirty line and reread from memory, you just increased access latency.


I listed a number of advantages of this architecture. Please see post #129, pg. 6
yeah i dont see it. AMD think they can address 80% of the market with naples, they will be able to address even more with a 64 core 256mb L3 , improved inter CCX latency SKU. How much more of the market does your design on up and at what cost?


I acknowledge this point. But it may not be as bad as you think. Remember this architecture removes the need for 6 criss-crossing IF links needed for interconnecting the 4 dies in NAPLES architecture. These links use (relatively) high power SERDES's to drive (relatively) long traces. For ROME, these links are also expected to run at 25MT's vs 10.7 MT's in NAPLES, so there's some more power saved here.
Yes but now you have made every request require its use there are many common scenarios were the GMI interfaces will see little use , how does that play out on a NUMA aware hypervisor where 90% + of vms are 4 threads and below and almost all 8 threads and below. Or a NUMA aware renderer etc


. Remember, ROME was designed to compete with Sky Lake, not Cascade Lake or the hack-job Cooper Lake. Prior to the 64C/9-die rumor, I also thought ROME would be a small evolutionary step from NAPLES.
This is wrong, AMD themselves have said otherwise, it was designed to compete with a projected intel 10nm Server SKU.[/QUOTE]
 

kokhua

Member
Sep 27, 2018
86
47
91
This is wrong, AMD themselves have said otherwise, it was designed to compete with a projected intel 10nm Server SKU.

My mistake. I meant to say Ice Lake, not Sky Lake. Will respond to other points when I have time; busy at the hospital.....
 

DrMrLordX

Lifer
Apr 27, 2000
21,804
11,157
136
An 8-core Ryzen 3000 ("Matisse") engineering sample has supposedly arrived at RTG.

Now the questions are: Is it an APU?

Not sure if I'm repeating myself or not (skimming some of the speculative posts here), but if you read the thread you pasted, _mockingbird has said that it is at RTG for interconnect testing. Specifically:
Apparently, there has been some changes to the "interconnect" (wherever that is) that requires RTG to make changes to the video drivers and that why RTG is getting the sample.

So it is likely that Matisse has the same underlying layout as Pinnacle Ridge/Summit Ridge.
 

yuri69

Senior member
Jul 16, 2013
437
717
136
...
Am I correct? Did I leave out anything?
Nope, you didn't leave out anything important.

Evolutionary, not revolutionary. Cheap, not expensive. Common sense, not over-engineered.

Btw this chiplet idea comes from a single source - the youtuber... That makes it really super-credible.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
The 8 chiplets + 1 approach is working ok for a high-end high-margin server SKU, but its very expensive for a 8C 16T entry-level server SKU part.

So either they will create two different dies or they will continue with a single die to rule them all like EPYC 1.
 

kokhua

Member
Sep 27, 2018
86
47
91
Nope, you didn't leave out anything important.

Evolutionary, not revolutionary. Cheap, not expensive. Common sense, not over-engineered.

Btw this chiplet idea comes from a single source - the youtuber... That makes it really super-credible.

Don't be too quick to judge. AdoredTV is not the first or the only one who said this. Besides, as I said, I no longer doubt that ROME will be will 9 dies not just because the rumor sources are credible (in my opinion), but also because it makes sense.

I drew this diagram. It is my purely speculation on my part, nothing to do with the rumor sources. I won't claim it is entirely original though. Everything, except memory compression and L4 cache, is deduced logically from the rumors. It is neither expensive or over-engineered. Nor is it nonsensical. I will address your other points once I am free. Give me a bit of time.

Clearly, people have difficulty imagining AMD being bold enough to do something revolutionary. They did that with Zen, didn't they?
 
Reactions: amd6502

yuri69

Senior member
Jul 16, 2013
437
717
136
Don't be too quick to judge. AdoredTV is not the first or the only one who said this. Besides, as I said, I no longer doubt that ROME will be will 9 dies not just because the rumor sources are credible (in my opinion), but also because it makes sense.
Please, could you name the other sources?

It is neither expensive or over-engineered. Nor is it nonsensical. I will address your other points once I am free. Give me a bit of time.
Looking forward to this.

Clearly, people have difficulty imagining AMD being bold enough to do something revolutionary. They did that with Zen, didn't they?
Sorry to crush your image of Zen, but in fact, the 17h is not revolutionary in the x86 landscape. It's more an evolution of Intel Core processor architecture with AMD's traditional features (including MCM). OTOH, 15h/Bulldozer was clearly a revolutionary x86 design...

When you focus purely on AMD, then sure the 17h is a revolution coming from 15h. But still, this completely new AMD family/architecture just expanded Magny's MCM.

I can't see AMD pulling off a very dubious and radical architecture departure during a current architecture refresh.

kokhua said:
...CPU product stack that will deliver a knock-out to Intel from top to bottom in 2019.
Intel will not be a sitting duck. They seem to plan stitching two 28c(???) in a MCM manner aka Cascade Lake-AP in a monstrous 5903pin socket. Sure, the TDP will go through the roof but benchmarks will probably look OK as usually.
 

Abwx

Lifer
Apr 2, 2011
11,167
3,862
136
Sorry to crush your image of Zen, but in fact, the 17h is not revolutionary in the x86 landscape. It's more an evolution of Intel Core processor architecture with AMD's traditional features (including MCM). OTOH, 15h/Bulldozer was clearly a revolutionary x86 design...

.

These are completely different uarches, acually Zen is an evolution of BD with the two cores being fused and the FPU kept unchanged.
 
Reactions: amd6502

kokhua

Member
Sep 27, 2018
86
47
91
Please, could you name the other sources?.

Sorry, I am not at liberty to say. But I suspect you know at least one other source.

Looking forward to this.

I promise, but give me some time. Have to attend to some family medical issues first. This and your other points deserve a proper response. But a quick one on the cost: I have done some very crude estimates. Assuming that we stick with organic MCM, my hypothetical 8x8C+1 ROME will cost roughly the same as 4x16C NAPLES (scaled to 7nm); around $200.

Sorry to crush your image of Zen, but in fact, the 17h is not revolutionary in the x86 landscape. It's more an evolution of Intel Core processor architecture with AMD's traditional features (including MCM). OTOH, 15h/Bulldozer was clearly a revolutionary x86 design...

When you focus purely on AMD, then sure the 17h is a revolution coming from 15h. But still, this completely new AMD family/architecture just expanded Magny's MCM.

It's OK, we can have our opinions. To me, Zen seems quite revolutionary, technically and commercially. Enough to get BK retired (verb). Just curious, how do you keep those 17h, 15h numbers in your head?? I can't even remember code names; drowned in the lakes long ago ;-)

I can't see AMD pulling off a very dubious and radical architecture departure during a current architecture refresh.

Radical? Yes; it took me 2 months to accept my own conclusion. Dubious? Time will tell.

I don't know about you. I didn't foresee AMD coming up with something like EPYC, a modular MCM with 4 Zeppelin dies, each a standalone processor, re-useable for Ryzen. Or a 32C Threadripper. Or that Zen would actually deliver a huge increase in IPC. I suspect Intel did not either. Hell, many thought AMD will go bankrupt not so long ago!

Intel will not be a sitting duck. They seem to plan stitching two 28c(???) in a MCM manner aka Cascade Lake-AP in a monstrous 5903pin socket. Sure, the TDP will go through the roof but benchmarks will probably look OK as usually.

Of course not. But seriously, you don't think stitching 2 huge dies together with EMIB, and a 350W TDP is going to stave off AMD, do you?
 
Last edited:

yuri69

Senior member
Jul 16, 2013
437
717
136
I don't know about you. I didn't foresee AMD coming up with something like EPYC, a modular MCM with 4 Zeppelin dies, each a standalone processor, re-useable for Ryzen. Or a 32C Threadripper. Or that Zen would actually deliver a huge increase in IPC. I suspect Intel did not either. Hell, many thought AMD will go bankrupt not so long ago!
Frankly, when the first "new thing" rumors surfaced, I thought AMD would slap 16c of more or less enhanced 16h/Jaguar cores together. At that time the cat cores did OKish and 'micro-servers', web serving and ARM were the main buzzwords.

When it became apparent they intend to replace their high-perf core aka Bulldozer, it was nearly given they would go the 2 way MCM as they did with the Interlagos, Magny and many previous ESes.

The IPC had been predicted as ~Nehalem. However, this slowly improved when the tight latencies, balanced FPU and INT pipelines description appeared.

Of course not. But seriously, you don't think stitching 2 huge dies together with EMIB, and a 350W TDP is going to stave off AMD, do you?
Well, given Intel's marketing/relationship/lineage/powers, it wouldn't be that surprising. Objectivity can go aside. There are people who remember eg. Intel Dunnington era.
 

beginner99

Diamond Member
Jun 2, 2009
5,223
1,598
136
Ryzen Desktop and Notebooks: Different 8C/16T APU, i.e. bring back integrated GPU to mainstream desktop CPUs. A competitive CPU paired with a superior GPU would be a win for AMD. Fuse off features for product segmentation.

While a separate consumer die is possible, I highly doubt it contains a iGPU simply due to die size needed (cost and yield of 7nm) and the lack of need for it. People that buy 8-cores also buy dGPUs. Intels 6-core and upcoming 8-core have an igpu because these will also be used in laptops. AMD is focusing on the mainstream laptop market were 4-core + gaming capable igpu works best.
 

kokhua

Member
Sep 27, 2018
86
47
91
Yes, of course, every link on higher levels needs to carry more traffic from the multitude of cores at lower levels (e.g. socket to socket). But you can approach this in many ways, i.e. with different topologies. I discuss this to some degree in my topology thread. In particular, the fatter links on higher hierarchical levels need not go between just two end-points (routers) — and in fact, in the Zen design, in some cases they do not. Instead, they use a router-less approach with multiple sublinks.

For example, the fat link between two EPYC sockets comprises of four sublinks, each between the corresponding dies in each socket (die 0 in socket 0 links to die 0 in socket 1, etc.). This obviates the need for a central router in each socket, which may have become a bottleneck.

On the other hand, I expect that each CCX has an Infinity Fabric controller that routes the traffic in and out of the CCX. I am not sure how the dies in the MCM package are interconnected, i.e. whether each die has a router, or there are direct sublinks between corresponding CCXs in each die (CCX 0 in die 0 links to CCX 0 in die 1, etc.).

However, in my topology thread, I speculate and calculate how many links and ports are required to use the router-less approach at every level, even for the CCXs (i.e. core 0 in CCX 0 is direct-connected to core 0 in CCX 1, etc.).

If you draw out the topology for this router-less approach it will look like a sparsely connected hypercube, I guess. But now we are in territory were I am out of my depth. I may miscategorise these things. This is why I started the topology thread in the first place. I was hoping for an expert or two to chime in.

Vattila, I'm sorry. I think this subject is a bit too academic for me. There's not much I can contribute that will further our understanding. For my purpose, I am perfectly happy to treat it as a black box and let the experts worry about the details. I'd like to stop here if you don't mind.

Now, back to your architecture: How do you handle the choke point in your System Controller chip? Are there connections between the 8-core chiplets, or do all connect only to a central router (crossbar)? If the latter, how much bandwidth does it need to handle, and is that feasible?

Again, I'm out of my depth here. But this is not something new. Ampere's just released ARM Server Processor sports a very similar architecture, except it's monolithic (https://amperecomputing.com/wp-content/uploads/2018/02/ampere-product-brief.pdf). AMD certainly does not lack expertise in this.
 
Reactions: Vattila
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |