64 core EPYC Rome (Zen2)Architecture Overview?

Page 11 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

mattiasnyc

Senior member
Mar 30, 2017
356
337
136
Very confused by this post.

At present, we have AM4 with 2 channels for an 8 core Zen die. Do you see an increased IPC and frequency 8 core Zen2 CPU being unaffected by dropping to 1 memory channel? This is to be a general purpose CPU, not just for compute bound programs.

Sorry, I actually didn't mean 1 channel per die. You're absolutely right and I read the post you were responding to too quicklyso I thought about it backwards... I basically "saw" 1 channel / ccx and figured that's what we have, so why would we see a problem... But now that I read it again I see it indeed said 1 channel per die, which.... why? To keep AM4 compatibility with a higher core count? Was that the point? If so I agree that it seems pretty meaningless.

Just ignore what I wrote.

My bet is on a better / faster architecture coming up for AM4, and possibly Threadripper. Epyc could possibly use some more cores.
 

mattiasnyc

Senior member
Mar 30, 2017
356
337
136
Think 2990wx. I have 4 channels supporting 32 cores. Does pretty well. Clocked at 3700. This will be the same density as 64 cores on 8 channels. And essentially same socket.

I think I wrote my reply without a clear understanding of what Maddie replied to, which was sloppy.

Anyway, on "our" end of HEDT usage (content creation) the design of the 2990wx is very hit-or-miss, and I really mean that. With some workloads it does absolutely great, specifically ones that are heavy on computer and light on memory usage. But for timing critical workloads like audio it seems to run into a wall the way it stands today. The only test I've seen on it essentially has the CPU's cores woefully underutilized until the load gets high enough leading to the entire system hanging. The most reasonable culprit here is the need to jump through dies to access memory while that access is timing critical (audio).

So I get that conceptually it might look appealing to double the core count on Ryzen the way it was done on the 2990wx, but I really do wonder if that's going to be all that beneficial if it suffers the same potential drawbacks. I'm assuming btw that that's what the proposition was.

Also, there is of course the chance (?) that future changes alleviates this potential bottleneck.
 

lightmanek

Senior member
Feb 19, 2017
399
798
136
I think I wrote my reply without a clear understanding of what Maddie replied to, which was sloppy.

Anyway, on "our" end of HEDT usage (content creation) the design of the 2990wx is very hit-or-miss, and I really mean that. With some workloads it does absolutely great, specifically ones that are heavy on computer and light on memory usage. But for timing critical workloads like audio it seems to run into a wall the way it stands today. The only test I've seen on it essentially has the CPU's cores woefully underutilized until the load gets high enough leading to the entire system hanging. The most reasonable culprit here is the need to jump through dies to access memory while that access is timing critical (audio).

So I get that conceptually it might look appealing to double the core count on Ryzen the way it was done on the 2990wx, but I really do wonder if that's going to be all that beneficial if it suffers the same potential drawbacks. I'm assuming btw that that's what the proposition was.

Also, there is of course the chance (?) that future changes alleviates this potential bottleneck.


Extrapolating from 2990WX is not the right way to look at it. We have to keep relative memory latency close to what current Ryzen has and halve bandwidth. Besides, it's almost guaranteed that new Ryzen will have higher official memory speed support to offset its more hungry per core bandwidth needs.
Another important part of the whole thing is cache system. We don't yet know how it changed, if at all, compared to first Ryzen iteration.

Besides, we will be getting 8C APU's based on Zen looking at what future consoles shape up to be.
 

Vattila

Senior member
Oct 22, 2004
805
1,394
136
Ok, so based on the 8+1 chiplet rumour, we have the these hypotheses for "Rome":
  1. Each 7nm CPU chiplet will have no memory controllers. It only contains two CCXs (8 cores and cache), while all uncore logic, including memory controllers, will be placed on a separate 12nm System Controller chiplet.
  2. Same as (1), but the CPU chiplet contains 1 or 2 memory controllers.
  3. Same as (2), but the CPU chiplet is a full SoC (like current "Zeppelin"), containing all the necessary uncore. The System Controller chip is primarily for chiplet interconnect and possibly L4, cache-coherency logic and socket interconnect (for scaling to 4 sockets).
Then, depending on these hypotheses for "Rome", we have the following hypotheses for Ryzen 3000 ("Matisse"):
  1. Has its own 8-core monolithic 7nm SoC die with two CCXs, and likely an iGPU as well.
  2. Reuses the "Rome" CPU chiplet, with a separate dedicated System Controller chiplet.
  3. Same as (2), but with no need for a System Controller chiplet, since each CPU chiplet is a full SoC.
  4. (1), (2) or (3), but with two CPU chiplets, for a 16-core MCM solution.
Here I guess (1) or (3) are most likely, to avoid MCM packaging in the mainstream. My bet is (1), a dedicated APU die. This solution allows the "Rome" CPU chiplet to be very small, while the 8-core APU die matches Intel features in the client segment. On the other hand, (3) would be very much like the situation today, with maximum die reuse between server and client products. Regarding (4), I think more than 8 cores for AM4 is doubtful, based on bandwidth and competitive reasons (Intel isn't moving beyond 8 cores any time soon).

There is a further variant of the hypotheses above, with the two 4-core CCXs replaced by an 8-core CCX, but I find that unlikely, based on architecture and latest rumours.

Cannot wait to find out more — AMD's next Horizon event on November 6 is going to be exciting!
 
Last edited:

Gideon

Golden Member
Nov 27, 2007
1,703
3,912
136
Memory bandwidth needs for ordinary tasks are IMO quite overrated. I mean, Ryzen has been tested with single channel ram time and again. Obviously it costs performance, but nowhere near enough , to make a 16 core CPU unworkable on AM4, especially if it supports, say, up to 4000 MHz of DDR4.

1 Single-Channel stick @ 3200 Mhz is almost as fast in gaming as Dual-Channel running @ 2133 Mhz. That's true even in games that can utilize all the cores (like BF1). Obviously far from ideal, but nowhere near a disaster:

Single- vs Dual-Channel
(video here, with other results as well)
CPU: R7-1700 @ 3.75 Ghz
RAM: 2133 Mhz - 14-14-14-34 & 3200 Mhz - 16-17-17-35


Most software rarely utilizes more than 8 cores, so the per-core bandwidth for those workflows would be pretty much the same as now. For those that do, IMO as Threadripper WX models have already shown us, that the bandwidth of AM4 could be enough for monolithic 16-core processors (especially with a bit faster memory).

Now, if we're talking about a two-die 16-core AM4 product, then yes, obviously 1 channel per-die won't cut it with the current architecture (MCM) It would have similar limitations as Threadripper 1, which for the mainstream platform are unacceptable.

The only way a 2-die system would work, is if it's designed like the rumored Epyc 2. With an interposer and a third chiplet for uncore and memory balancing between the two core-chiplets (which both have 8 cores). That could keep the cross-chip latency in check and assign all the channels to one chip, if needed.

Now I doubt that AMD goes that route. 7nm is already expensive. The production costs for such a chip would be high. This would probably be sold closer to 1000$ than 500$, which might be OK for a halo product, but considering this has very limited market interest outside mainstream-desktop (no server or mobile processors would want such a processor), I don't think AMD will go down that route.
 

DigDog

Lifer
Jun 3, 2011
13,613
2,186
126
I personally don't think AMD is likely to win those deals. Intel is really entrenched, and can add features that AMD can't as easily (LTE/5G modem). I think that is part of the reason Intel is developing their own GPU too (and pretty clearly would be planning on slotting their own into packaging like that one they did with AMD). I also think Apple is working on their own SoC (which won't happen really soon, but it means that Apple isn't interested too much in rocking the boat, and slapping AMD in there, while it shouldn't be a huge problem, still would require a pretty healthy amount of tweaking - and I believe Intel still has an edge in that power level - largely because Apple has tailored so much software tweaks to take full advantage of Intel's power management setup).

Which, maybe 7nm will give them enough advantage that they could win that, as Zen 2 with Navi should be a solid performer and good fit for the Macbook Pros. And it could be 2-3 years before Intel has a GPU product to slap in there, and if their 10nm keeps being problematic, then it would give AMD an edge. Something else that shouldn't be ignored, AMD and Apple would be on the same process, so Apple could integrate their own stuff into the AMD stuff fairly easily. I'm still doubtful, but it presents an interesting opportunity/situation.

I feel if AMD is going to win some of Apple's CPU deals, its probably going to be in the Mac Pro stuff, where they can put Threadripper (or maybe even EPYC; possibly a custom chip as well, like maybe they make a large interposer and slap GPU on there with CPU with a bunch of HBM between and then they give you the option of how many you want).
Im just quoting your first sentence.

Companies do not sell or not sell based on the quality of the product; sure quality helps, but if that was the only metric, companies like BT would do 0 sales.
 

Gideon

Golden Member
Nov 27, 2007
1,703
3,912
136
We will know more in a few weeks, but I'm leaning on the following segmentation for AMD 7nm:

Epyc 2 and Threadripper 3 seem pretty clear-cut:

Epyc: 8x 8-core 7nm chiplets with a 14nm uncore chiplet in a UMA configuration on an (active-?) interposer, 8 channels of RAM
Threadripper WX: 4x 8-Core chiplets with the same uncore chiplet (half-disabled) 4 channels of RAM.
Threadripper X: 2x 8-Core chiplets (with possible 2 "dummy dies") with 1x uncore chiplet, 4 channels of RAM

Considering the cost of those systems, disabling half of the 14nm uncore chip's channels for Threadripper, is an easy decision to make vs the costs of a separate chip.

Beyond it get's a bit dicey. While AMD could use the same 8-core chiplet on AM4, that would require a new smaller uncore chip (for 2 channels), an interposer. Furthermore, that setup would have very limited interest outside mainstream-desktop (perhaps some embedded devices?).

Therefore I'm more inclined to believe, that Matisse is actually a monolithic 8-core CPU with a GPU. For Intel an equivalent Coffee lake chip (8 cores + GPU) on 14nm is only about ~177 mm2, so it should be very doable for AMD on 7nm.

An integrated GPU would bring quite a few advantages (even for high-end desktop):
1. Integrated GPU would allow to close the gap in some Adobe Encoding workloads, that utilize QuickSync for some paths, in addition to cores and CUDA.
2. There are still quite a few office-desktops (for programmers, scientists e.g.) who don't really need a GPU (for other than display purposes). This has been one key selling point for Coffee-Lake so-far (no need for a separate GPU).

Not to mention the most obvious one. It would allow AMD to use the same chip for mobile. I can't find it, but I remember clearly someone from AMD stating in an interview that 7nm with it's 2x perf/watt improvements would allow 8-core mobile chips down to even 15W (though @ that level, clock speeds would clearly suffer). In the 45-25W range they would run very well.

Therefore Matisse could be used for:
8-4 core desktops in 95-35W TDP range
8-6 core mobile workstations @ 45-35W (high-end coffee-lake competitors)
6-4 core ultrabooks @ 25-15W

For any budget parts, AMD could continue to use the 12nm Picasso.

IMO such a segmentation just makes the most sense. AMD would pull this off with only two different 7nm chips and a 14nm uncore chip. Any other segmentation would require quite a few more.

It would also play to the 7nm process strengths. Mattise probably won't out-clock Intel anyway in the very-high end, but such a chip will destroy any 6+ core Coffee-Lake mobile processors in performance per watt Especially in heavy workloads (it will start to throttle way later, etc ...)

EDIT:
There also are some drawbacks of course. The main one that might be a deal-breaker is timing:
1. 7nm GPU would have to be Vega, as Navi probably wouldn't be ready by the rumored timeframe.
2. Raven Ridge is almost as big as Pinnacle-Ridge die-size wise. Therefore the GPU probably couldn't really be all that big. Even 11CUs would be overkill (and waste of silicon) for the majority of tasks. Going much lower than that, would be a prestige hit, as the 7nm GPU would be slower than the 14nm one. I guess 8CUs with higher clock-speeds would be possible (but again highly disappointing to the Integrated Gaming crowd)
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,785
11,128
136
Think 2990wx. I have 4 channels supporting 32 cores. Does pretty well. Clocked at 3700. This will be the same density as 64 cores on 8 channels. And essentially same socket.

That's not really analogous though. Not only are you dealing with more cores, but you have additional IF links adding latency to any memory access from the two RAM-isolated dice. Additionally, anyone using a 2990WX is probably choosing it for specific workloads that are not going to suffer terribly from the aforementioned latency/memory access problems.

We are basically looking at the possibility of running an 8c/16t AM4 chip with only one memory channel.

1 Single-Channel stick @ 3200 Mhz is almost as fast in gaming as Dual-Channel running @ 2133 Mhz.

That's hardly a ringing endorsement. DDR4-2133 performance on Ryzen/Ryzen+ can be pretty frustrating. Maybe not "bad" since it still kills anything AMD released before it, but . . . definitely frustrating.

I would expect 16c/32t Zen2 to outperform 8c/16t Zen2 on AM4, but there are some circumstances where bandwidth starvation will be an issue.
 

mattiasnyc

Senior member
Mar 30, 2017
356
337
136
anyone using a 2990WX is probably choosing it for specific workloads that are not going to suffer terribly from the aforementioned latency/memory access problems.

Well, yes. Of course those buying pricier CPUs will buy whatever works for their specific workloads. I'll just reiterate that the 2990wx would have been killing it in the pro-audio space had it not been for what appears to be latency issues related to the architecture. I think the same is true for Threadripper overall although the lack of thunderbolt support doesn't help either.

Point being that for some industries that are a subset of an already narrow subset of CPU buyers these CPUs are getting less sales than one might hope and believe they are.

Memory bandwidth needs for ordinary tasks are IMO quite overrated. I mean, Ryzen has been tested with single channel ram time and again. Obviously it costs performance, but nowhere near enough , to make a 16 core CPU unworkable on AM4, especially if it supports, say, up to 4000 MHz of DDR4.
I think one thing
to remember is market segmentation. So the question to me is what segment AM4 occupies when competing with Intel. If AMD was to produce an AM4 CPU with more than 8 cores, what segment of the market would that occupy?

For Intel content creators actually do begin their builds at around the 8700k/9900k level and then move up to "HEDT" from there. So I'd wonder if for example a 16core would be taken advantage of enough in gaming to justify its purchase, while simultaneously be hampered by memory performance (not necessarily just bandwidth) to not be able to compete in productivity (probably content creation).
 

jpiniero

Lifer
Oct 1, 2010
14,823
5,440
136
So the question to me is what segment AM4 occupies when competing with Intel. If AMD was to produce an AM4 CPU with more than 8 cores, what segment of the market would that occupy?

The market right now is a tad irrational. Just for marketing purposes, a 2 die 12 core R9 might make sense.
 
Mar 11, 2004
23,155
5,623
146
Im just quoting your first sentence.

Companies do not sell or not sell based on the quality of the product; sure quality helps, but if that was the only metric, companies like BT would do 0 sales.

Am...I missing something? You're just quoting my first sentence, but you quoted the entirety of my post? Did you mean responding to? Which, with your second sentence, makes no more sense than if you were responding to the entirety of my post, which is to say it makes no sense. I have absolutely no clue how your post is even a response to mine.
 

DigDog

Lifer
Jun 3, 2011
13,613
2,186
126
Am...I missing something? You're just quoting my first sentence, but you quoted the entirety of my post? Did you mean responding to? Which, with your second sentence, makes no more sense than if you were responding to the entirety of my post, which is to say it makes no sense. I have absolutely no clue how your post is even a response to mine.
Whether AMD will win supply deals is only partly based on the quality of product they can supply - the same applies to Intel.
Sales are dependant on finance options, personal politics, marketing campaigns, added value products, pre-existing tie-in deals, associated products, and plain ol'branding and lies - not "tech" lies, just lies.
Consider that they actually sold a great deal of Bulldozer chips, when no (non-partisan) enthusiast would buy one. Why ?
Because they got sales people who know the art of selling.
They could sell sand to an arab; a fridge to an eskimo.
The quality of the product is *not* the primary decision factor.
 
Last edited:
Reactions: TheGiant

kokhua

Member
Sep 27, 2018
86
47
91
Sorry guys, some how I can't embed the images and make them show up inside the post. Anybody wish to, please help me do that and delete the old posts. Thx.
 

Vattila

Senior member
Oct 22, 2004
805
1,394
136
Sorry guys, some how I can't embed the images and make them show up inside the post. Anybody wish to, please help me do that and delete the old posts. Thx.

To add to @PeterScott's helpful advice:

You can edit your posts. You can also preview your post before posting, to make sure it displays correctly before you commit (press the "More options..." button). While you cannot delete a post, you can edit it and remove all contents.
 

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
I had thought, to implement this successfully, the Local Infinity Fabric [LIF - within the package] would need to run at a different clock speed to Global Infinity Fabric [GIF - to System RAM].

But you are postulating an architecture where the L4 cache is a duplicate of all L3 cache (and at 512MB bit more), so at most there is one 'hop' to get any on-socket data - which wouldn't need absurdly high LIF speeds clocking far in excess of GIF speeds.

Zen IF was not run asynchronous to system RAM because apparently synchronising traffic would become too troublesome. If the system controller chip operated as a buffer then I suppose it may ease that challenge somewhat.
 
Reactions: Gideon and Vattila

kokhua

Member
Sep 27, 2018
86
47
91
I had thought, to implement this successfully, the Local Infinity Fabric [LIF - within the package] would need to run at a different clock speed to Global Infinity Fabric [GIF - to System RAM].

Zen IF was not run asynchronous to system RAM because apparently synchronising traffic would become too troublesome. If the system controller chip operated as a buffer then I suppose it may ease that challenge somewhat.

Referring to the "AMD-style" version of the block diagram, you can see that I simply separated the CCX from the IF SDF Plane and put it on a separate die. The interface between the CCX and the SDF Plane remains the same as Naples. While not explicitly stated, I imagine it will run at MEMCLK, as in Naples.

One criticism of this architecture is that separating the CCX from the memory controllers adds latency, or some say, an "additional IF hop". But as you can see, it is identical to Naples except the CCX is on a seprate die. Added latency comes from drivers and wire delays introduced by the CPU-SC link. But the connections between the CPU die and the System Controller die are very, very short and direct (2-3mm max); the additional latency can't be more than just a few ns.


But you are postulating an architecture where the L4 cache is a duplicate of all L3 cache (and at 512MB bit more), so at most there is one 'hop' to get any on-socket data - which wouldn't need absurdly high LIF speeds clocking far in excess of GIF speeds.

I don't understand what you mean by "L4 cache is a duplicate of all L3 cache". As I indicated in the diagram, I expect the L3 cache to be doubled to 4MB/core (256MB total), mainly to mitigate memory latency introduced by disintegration, and bandwidth limitations of 8ch DDR4 in case bandwidth compression was not used. For L4 cache to be effective, it would have to be at least 512MB. After estimating the die size, I concluded that it not practical for a 12/14nm System Controller; the die size will be too large. There's also the issue of power consumption. Maybe that is possible in future, when the SC also moves to 7nm, idk. In any case, it is unnecessary to beat Cascade Lake and Cooper Lake.
 

moinmoin

Diamond Member
Jun 1, 2017
4,993
7,763
136
I don't understand what you mean by "L4 cache is a duplicate of all L3 cache".
All cores share the L3$, this is why currently all dies need to have direct connections. With your SC cores now would have to do two hops for every L3$ access to every other die. This would be an enormous constant traffic through the SC that is easily avoided by always having a copy of all dies' L3$ data in the L4$ (with the added benefit that the die-SC link itself no longer needs to scale with the amount of dies interfaced with the SC).
 

kokhua

Member
Sep 27, 2018
86
47
91
All cores share the L3$, this is why currently all dies need to have direct connections. With your SC cores now would have to do two hops for every L3$ access to every other die. This would be an enormous constant traffic through the SC that is easily avoided by always having a copy of all dies' L3$ data in the L4$ (with the added benefit that the die-SC link itself no longer needs to scale with the amount of dies interfaced with the SC).

I think this is incorrect, the L3$ is only shared by cores within the same CCX.
 

Abwx

Lifer
Apr 2, 2011
11,161
3,858
136
But the connections between the CPU die and the System Controller die are very, very short and direct (2-3mm max); the additional latency can't be more than just a few ns.

That may be short on eye sight but electrically that s extremely long, 2-3mm amount to a 1/4 wave length at 30GHz and 1/2 wave length at 60GHz, that s largely within the spectrum range of a 3GHz clock square signal, meaning that such a wire is incapable to transmit the higher harmonics of the signal wich are the frequency elements that make the slope of the rising and falling hedge being abrupt.

I wont even talk of a 60Gbit SerDes since the signal wouldnt even appear at the terminal of the receiver, as all energy would be diffused by the conductors wich would act as a perfectly tuned antenna..
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |