Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 439 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
805
1,394
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).



What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!
 
Last edited:
Reactions: richardllewis_01

nicalandia

Diamond Member
Jan 10, 2019
3,331
5,282
136
AMD can easily double their CPU sales with the release of a dual socket X690E mobo and quad channel 256GB RAM support. Intel HEDT will die tragically inside the womb.
Current ThreadRipper PRO CPUs(which are rebranded EPYC every single way except Higher clocks) can work on dual EPYC Boards. So a Dual 7955WX is a very real possibility or even better, Higher Core CPUs like Dual TR PRO 64/96 Core CPUS
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,101
136
I have seen some things saying that the infinity cache chiplets that go under the base die also include memory controllers. That would make a lot of sense since it decouples the memory controllers from the base die. They could use the same base die possibly across all of their products
You kinda lost me here. I don't think a base die makes sense just for high speed IO. Maybe if you fill the middle with cache or something else. Would be somewhat interesting from a packaging standpoint, however. Marrying two large dies vs one large die to a bunch of smaller dies.

But I think in the short term, the logical way to split the IO die would be to leave pretty much everything the same (including organic links to the compute chiplets), but split the IO die evenly into two halves, and connect them with an embedded bridge. Should be fairly low overhead.
 
Reactions: Tlh97 and Joe NYC

moinmoin

Diamond Member
Jun 1, 2017
4,994
7,765
136
The problem with just cutting the IOD in half or something similar is that you'd have to ensure the same bandwidth and latency to not be worse off than with a similar monolith die (Intel's approach with SPR). This tends to make going chiplets both complex and costly. AMD's approach so far has been to look for places where resulting link bottlenecks will be localized and not affect global performance, so CCDs in CPUs and now MCDs in GPUs. This significantly reduces the bandwidth requirement compared to e.g. cutting a ring bus within the IOD in half.

I could imagine splitting off some I/O to chiplets similar to what has been done with MCDs could be technically feasible. But since IODs already are dominated by I/O interfaces (which dictates their size) anyway there may be no savings in doing that.
 

desrever

Member
Nov 6, 2021
122
302
106
The problem with just cutting the IOD in half or something similar is that you'd have to ensure the same bandwidth and latency to not be worse off than with a similar monolith die (Intel's approach with SPR). This tends to make going chiplets both complex and costly. AMD's approach so far has been to look for places where resulting link bottlenecks will be localized and not affect global performance, so CCDs in CPUs and now MCDs in GPUs. This significantly reduces the bandwidth requirement compared to e.g. cutting a ring bus within the IOD in half.

I could imagine splitting off some I/O to chiplets similar to what has been done with MCDs could be technically feasible. But since IODs already are dominated by I/O interfaces (which dictates their size) anyway there may be no savings in doing that.
Theres places to move things off die on the IO die for sure. PCIe and DDR5 can be split. Currently the monolithic IO die works well enough and make economic sense but there is definitely ways to split it. Meteor Lake has dies for both IO and SOC as different tiles. Expect this to be very easily done on AMD side as well.

I expect they may move toward an MCD style design as well eventually. It allows for an easy way to tier their product stack. 4/8/12/16 channel memory by using more MCDs would allow fine control for HEDT and Servers.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,101
136
The problem with just cutting the IOD in half or something similar is that you'd have to ensure the same bandwidth and latency to not be worse off than with a similar monolith die (Intel's approach with SPR). This tends to make going chiplets both complex and costly. AMD's approach so far has been to look for places where resulting link bottlenecks will be localized and not affect global performance, so CCDs in CPUs and now MCDs in GPUs. This significantly reduces the bandwidth requirement compared to e.g. cutting a ring bus within the IOD in half.

I could imagine splitting off some I/O to chiplets similar to what has been done with MCDs could be technically feasible. But since IODs already are dominated by I/O interfaces (which dictates their size) anyway there may be no savings in doing that.
AMD already employs a NUMA structure internally, and many workloads can benefit significantly from enabling NUMA-aware settings in the BIOS. Dell describes some recommendations and benchmarks here: https://infohub.delltechnologies.com/p/amd-milan-bios-characterization-for-hpc/

So all that you'd be doing by splitting the IO die is adding a couple extra cycles of latency between some of the domains, which really isn't that bad. And an embedded bridge should be plenty sufficient bandwidth-wise.
 

Mopetar

Diamond Member
Jan 31, 2011
8,005
6,449
136
There might be more reason to split an IO die if it was only doing IO and different market segments only differed based on need for a scaling amount of IO.

Now that AMD has put graphical capabilities on their desktop Zen CPUs, a split die isn't going to happen. It would require too much duplication of resources that a server IO die has no need for at all.

Furthermore, it's not so simple in just the server market. There is a niche where someone doesn't care all that much about the core count, but does want the maximum number of PCI lanes or memory channels.

Designing and building a split die isn't going to be as economical as one might at first assume. The requirements between desktop and server have diverged enough where you need two separate designs and server isn't as simple as matching the number of cores to the amount of IO either.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,101
136
There might be more reason to split an IO die if it was only doing IO and different market segments only differed based on need for a scaling amount of IO.

Now that AMD has put graphical capabilities on their desktop Zen CPUs, a split die isn't going to happen. It would require too much duplication of resources that a server IO die has no need for at all.

Furthermore, it's not so simple in just the server market. There is a niche where someone doesn't care all that much about the core count, but does want the maximum number of PCI lanes or memory channels.

Designing and building a split die isn't going to be as economical as one might at first assume. The requirements between desktop and server have diverged enough where you need two separate designs and server isn't as simple as matching the number of cores to the amount of IO either.
Agreed that a split IO die for desktop or sharing an IO die between desktop and server definitely isn't in the cards. For users who just want IO and not core count, however, I don't see what the issue is. Just like today, they'd buy the full platform (in this theoretical case, w/ two IO dies), and just have fewer CCXs attached. I'm proposing it simply as a possible alternative to needing a different die between the low and high core count server chips (currently, Genoa vs Siena).
 

scineram

Senior member
Nov 1, 2020
361
283
106
AMD already employs a NUMA structure internally, and many workloads can benefit significantly from enabling NUMA-aware settings in the BIOS. Dell describes some recommendations and benchmarks here: https://infohub.delltechnologies.com/p/amd-milan-bios-characterization-for-hpc/

So all that you'd be doing by splitting the IO die is adding a couple extra cycles of latency between some of the domains, which really isn't that bad. And an embedded bridge should be plenty sufficient bandwidth-wise.
So they should make the whole thing worse for almost no benefit. Ok.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,101
136
So they should make the whole thing worse for almost no benefit. Ok.
There's a very clear benefit of only needing a single die for both markets, and potential yield/cost benefits for having a smaller die. Whether that would be worth the PnP overhead to the main server line, I didn't address.
 

moinmoin

Diamond Member
Jun 1, 2017
4,994
7,765
136
And an embedded bridge should be plenty sufficient bandwidth-wise.
Well yes, but for whatever reason even on AM5 and SP5 AMD still relies on tried and true substrate MCM. Maybe embedded bridges come with the Zen 5 IOD or possibly the Zen 4c IOD already, maybe much later, we'll see.

There might be more reason to split an IO die if it was only doing IO and different market segments only differed based on need for a scaling amount of IO.
So segmentation based on I/O capability? AMD turned that around and made it one of its (especially server) platforms' core features that I/O capability is pretty much the same regardless of the core count. And since that naturally comes at a cost SP6 with Siena is coming.
 
Reactions: Tlh97 and Joe NYC

Exist50

Platinum Member
Aug 18, 2016
2,452
3,101
136
Well yes, but for whatever reason even on AM5 and SP5 AMD still relies on tried and true substrate MCM. Maybe embedded bridges come with the Zen 5 IOD or possibly the Zen 4c IOD already, maybe much later, we'll see.
I don't think the two are necessarily contradictory. The bandwidth needs between two IO dies would be greater than from an IO die to an individual CCX. Though I wish they'd discuss their reasoning behind sticking with organic. Someone at AMD must have done the math.
 

moinmoin

Diamond Member
Jun 1, 2017
4,994
7,765
136
I don't think the two are necessarily contradictory. The bandwidth needs between two IO dies would be greater than from an IO die to an individual CCX. Though I wish they'd discuss their reasoning behind sticking with organic. Someone at AMD must have done the math.
It's likely the plan of a steady evolution of spec improvements, part of which being that even in Zen 4 load/store bandwidth is that low that higher DDR5 speeds end up making no performance difference.

The big question is what was driving this decision, like three years ago looking into the future that's now: Organic substrate MCM still being the only feasible mainstream packaging solution? High speed DDR5 not existing or too expensive? Segmentation between gens, Zen 4 taking after Zen 3 and only Zen 5 overhauling load/store speeds?
 

jamescox

Senior member
Nov 11, 2009
642
1,104
136
There might be more reason to split an IO die if it was only doing IO and different market segments only differed based on need for a scaling amount of IO.

Now that AMD has put graphical capabilities on their desktop Zen CPUs, a split die isn't going to happen. It would require too much duplication of resources that a server IO die has no need for at all.

Furthermore, it's not so simple in just the server market. There is a niche where someone doesn't care all that much about the core count, but does want the maximum number of PCI lanes or memory channels.

Designing and building a split die isn't going to be as economical as one might at first assume. The requirements between desktop and server have diverged enough where you need two separate designs and server isn't as simple as matching the number of cores to the amount of IO either.
I am not sure that is true. The consumer IO die is around 125 mm2; Genoa IO die is close to 400, even with extra switches and connectivity. The consumer IO die contains things not needed by servers, but that is easy to get around by stacking another chiplet on top (like a gpu chiplet) or embedding another die underneath with the needed functionality. If they use embedded die with micro-solder ball style stacking, then the embedded die can be made anywhere, like Global Foundries. The SoIC stacking would require both die to be made at TSMC. Stacking tech could allow for a very general base die since embedded die underneath can be used to attach to any type of memory, it can be independent of memory type. Just use a different embedded bridge die with a common interface to the top die.

You may be right partially right; I don’t know if I would expect to see something like this in most consumer parts. Most consumer parts should actually be APUs anyway. Maybe a high end consumer part with more than one stack, but I am not sure where that would fit in the market. For all but the highest end consumer parts, I have wondered if they basically could use an APU with GMI link(s) or just bridge die. Embed some low power Zen 4 based cores in an APU (basically an IO die) and then connect a Zen 5 chiplet for when more power is needed. That would make an excellent laptop chip and possibly cover most of the consumer product stack.

Splitting the server IO die would likely not be difficult. There are internal connectivity diagrams for it showing internal switches and such. It is split into 4 quadrants already with different latencies between them. It has Numa Per Socket (NPS) settings to take advantage of this. It can be set to NPS1, NPS2, or NPS4. The NPS settings also change the memory interleave. NPS1 interleaves across all 8 channels, NPS2 across 4 channels in each half, and NPS4 just interleaves across the 2 channels in each quadrant. These settings trade off between maximum bandwidth or lower latency, but require the application be numa aware. This seems like it would be very easy to split into separate die; it has never really been monolithic.

One other thing I have been thinking about is that Zen 5 will likely have massively increased FP power, so they will likely be adding HBM to HPC processors in addition to mixing CPU and GPU chiplets in the same package. This seems to imply that the cpu chiplets will need to be stacked somehow. I don’t know if it could be embedded die and/or GMI for everything. If you think about the layout of Genoa, with a centralized IO die, where would you put HBM? You want the HBM as close to the cpu or gpu cores as possible, not limited by a GMI link. How do you also scale it up to at least 8 compute chiplets? This makes me believe that the cpu cores may use similar, if not the same set-up as the GPUs. A base die with the cpu (or gpu or FPGA or whatever accelerator) stacked on top and then bridge chips to HBM or system memory would allow memory access without going through a GMI link.

The diagrams I have seen were showing that the next gen GPUs would be two elements, possibly stacks with embedded die and/or SoIC stacked on top. They are connected together with very high bandwidth, likely embedded bridge. They have HBM along one side and the other side is used to connect to another dual gpu module. Then two such sets can be connected together to make an 8 gpu chiplet device. This is why I was pointing at the diagram for crusher system here:


This crusher system seems like it is very a similar to upcoming systems that will mix CPUs and GPUs. This shows 200 GB/s link between adjacent gpus and 50 GB/s links for “remote” GPUs. It may actually be a test platform to some extent. The adjacent devices may move to silicon bridge connection. The MI250X appears to have 4x high speed gpu-gpu links (200 GB/s) for the adjacent gpu, 3 gpu-gpu links (50 GB/s each) for other GPUs or network (3 gpu or 2 gpu + 1 network), and one cpu-cpu link (36 GB/s) for connection to the cpu (off package). This seems to need 7x 50 GB/s links and 1x 36 GB/s link per GPU. That is a bit of die area, so moving that to a stacked die seems like a good idea. Also, they would want to use the same chiplets everywhere, so putting all of these links on the compute die itself doesn’t make that much sense. It would also waste die area on the compute die which generally uses the latest and greatest (and most expensive) process tech. Not all of the interfaces would be needed on all products, but splitting them out and making it on a cheaper node can be a win since you are wasting cheap silicon rather than expensive silicon.

It gets very difficult to speculate once 2.5D and 3D stacking become common since there are a lot of possibilities. This is more Zen 5 speculation rather than Zen 4. Although, we don’t seem to know exactly what Bergamo or Siena will actually be at this point. I was hoping for stacking with Bergamo, but is seems very unlikely to be anything other than a normal Genoa IO die. Also still hard to tell whether Siena will be salvage die only or a new IO die layout. With 64-cores, they have to do something more complicated than just half of a Genoa IO die. The 4 quadrants are somewhat independent, but it would only be able to support 6 chiplets with 2 quadrants, not 8. That means a completely new layout with a lot of units removed would be necessary.
 
Jul 27, 2020
17,916
11,687
116

That would be the monster truck of non-server CPUs.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,331
2,942
106
Well yes, but for whatever reason even on AM5 and SP5 AMD still relies on tried and true substrate MCM. Maybe embedded bridges come with the Zen 5 IOD or possibly the Zen 4c IOD already, maybe much later, we'll see.

It seems that Bergamo and Sienna will just reuse Genoa resources, I don't see AMD breaking new ground with these chips, which CAN reuse existing resources.

I would look at the next innovation to come from Mi300 side.

So segmentation based on I/O capability? AMD turned that around and made it one of its (especially server) platforms' core features that I/O capability is pretty much the same regardless of the core count. And since that naturally comes at a cost SP6 with Siena is coming.

I agree with that. There will be enough segmentation with SP5 and SP6.

If anything, maybe AMD could go to even lower end, for very low end servers using AM5 socket.
 
Reactions: Tlh97 and Vattila

moinmoin

Diamond Member
Jun 1, 2017
4,994
7,765
136
I would look at the next innovation to come from Mi300 side.
True, the Mi series certainly is where most of the interesting packaging tech development by AMD is happening right now.

If anything, maybe AMD could go to even lower end, for very low end servers using AM5 socket.
"Server grade" AM5 boards with certified ECC support (not only on-die) would be really nice to have.

That would be the monster truck of non-server CPUs.
Threadripper being based on SP5 ensures its price will never be as low as it was in the first couple gens again.
 
Last edited:

eek2121

Diamond Member
Aug 2, 2005
3,051
4,273
136
Why do you think running your own fab with all the R&D costs included is always cheaper that outsourcing to a specialist? That is such a misguided belief I see repeated contiguously.

Well maybe all the firms just abandon client. Joking of course, but they will have to drop prices as we see happening. With major companies firing workers & consumers revaluing wants, the TAM is falling. That's a fact that cannot be dismissed.

Do we still don't understand what's happening?

TSMC has a margin of 49%. To simplify things, (it is more complicated) that means that on a $16,000 wafer, TSMC pockets $7,840 of pure profit. Intel doesn’t necessarily have to worry about the profitability of it’s fabs, as long as overall margins are strong. That means, assuming all things like costs are equal (they aren’t) Intel can make and sell a similar sized chip for cheaper than AMD ever could.

Costs are variable, however, but not variable by 50%.
 

maddie

Diamond Member
Jul 18, 2010
4,787
4,771
136
TSMC has a margin of 49%. To simplify things, (it is more complicated) that means that on a $16,000 wafer, TSMC pockets $7,840 of pure profit. Intel doesn’t necessarily have to worry about the profitability of it’s fabs, as long as overall margins are strong. That means, assuming all things like costs are equal (they aren’t) Intel can make and sell a similar sized chip for cheaper than AMD ever could.

Costs are variable, however, but not variable by 50%.
Apples & Oranges.

What node is that TSMC $16,000 wafer? 5nm?
What node is Intel using for most advanced bulk processing? Intel 7?
What are the density differences?
What is the TSMC cost of that equivalent? $9,000 or less (TSMC 6)?
 
Reactions: Kaluan

biostud

Lifer
Feb 27, 2003
18,397
4,963
136

eek2121

Diamond Member
Aug 2, 2005
3,051
4,273
136
Apples & Oranges.

What node is that TSMC $16,000 wafer? 5nm?
What node is Intel using for most advanced bulk processing? Intel 7?
What are the density differences?
What is the TSMC cost of that equivalent? $9,000 or less (TSMC 6)?

Absolutely not apples and oranges. Bottom line is that it costs Intel less to make chips than AMD.
 
Reactions: scineram

Abwx

Lifer
Apr 2, 2011
11,167
3,862
136

Competition is good, looking forward to CES announcements.

Actually prices "dropped" down to AMD s announced MSRP...
 

Harry_Wild

Senior member
Dec 14, 2012
841
152
106

Competition is good, looking forward to CES announcements.
I do see price discounts off now between $5 & $10 for 7950X! Only idiot would even think of being motivated to buy at these minor price discounts!
 
Reactions: Thibsie
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |