Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

igor_kavinski · Nov 17, 2022

AMD can easily double their CPU sales with the release of a dual socket X690E mobo and quad channel 256GB RAM support. Intel HEDT will die tragically inside the womb.

nicalandia · Nov 17, 2022

igor_kavinski said:
AMD can easily double their CPU sales with the release of a dual socket X690E mobo and quad channel 256GB RAM support. Intel HEDT will die tragically inside the womb.

Current ThreadRipper PRO CPUs(which are rebranded EPYC every single way except Higher clocks) can work on dual EPYC Boards. So a Dual 7955WX is a very real possibility or even better, Higher Core CPUs like Dual TR PRO 64/96 Core CPUS

Exist50 · Nov 18, 2022

jamescox said:
I have seen some things saying that the infinity cache chiplets that go under the base die also include memory controllers. That would make a lot of sense since it decouples the memory controllers from the base die. They could use the same base die possibly across all of their products

You kinda lost me here. I don't think a base die makes sense just for high speed IO. Maybe if you fill the middle with cache or something else. Would be somewhat interesting from a packaging standpoint, however. Marrying two large dies vs one large die to a bunch of smaller dies.

But I think in the short term, the logical way to split the IO die would be to leave pretty much everything the same (including organic links to the compute chiplets), but split the IO die evenly into two halves, and connect them with an embedded bridge. Should be fairly low overhead.

moinmoin · Nov 18, 2022

The problem with just cutting the IOD in half or something similar is that you'd have to ensure the same bandwidth and latency to not be worse off than with a similar monolith die (Intel's approach with SPR). This tends to make going chiplets both complex and costly. AMD's approach so far has been to look for places where resulting link bottlenecks will be localized and not affect global performance, so CCDs in CPUs and now MCDs in GPUs. This significantly reduces the bandwidth requirement compared to e.g. cutting a ring bus within the IOD in half.

I could imagine splitting off some I/O to chiplets similar to what has been done with MCDs could be technically feasible. But since IODs already are dominated by I/O interfaces (which dictates their size) anyway there may be no savings in doing that.

desrever · Nov 18, 2022

moinmoin said:
The problem with just cutting the IOD in half or something similar is that you'd have to ensure the same bandwidth and latency to not be worse off than with a similar monolith die (Intel's approach with SPR). This tends to make going chiplets both complex and costly. AMD's approach so far has been to look for places where resulting link bottlenecks will be localized and not affect global performance, so CCDs in CPUs and now MCDs in GPUs. This significantly reduces the bandwidth requirement compared to e.g. cutting a ring bus within the IOD in half.

I could imagine splitting off some I/O to chiplets similar to what has been done with MCDs could be technically feasible. But since IODs already are dominated by I/O interfaces (which dictates their size) anyway there may be no savings in doing that.

Theres places to move things off die on the IO die for sure. PCIe and DDR5 can be split. Currently the monolithic IO die works well enough and make economic sense but there is definitely ways to split it. Meteor Lake has dies for both IO and SOC as different tiles. Expect this to be very easily done on AMD side as well.

I expect they may move toward an MCD style design as well eventually. It allows for an easy way to tier their product stack. 4/8/12/16 channel memory by using more MCDs would allow fine control for HEDT and Servers.

Exist50 · Nov 18, 2022

moinmoin said:
The problem with just cutting the IOD in half or something similar is that you'd have to ensure the same bandwidth and latency to not be worse off than with a similar monolith die (Intel's approach with SPR). This tends to make going chiplets both complex and costly. AMD's approach so far has been to look for places where resulting link bottlenecks will be localized and not affect global performance, so CCDs in CPUs and now MCDs in GPUs. This significantly reduces the bandwidth requirement compared to e.g. cutting a ring bus within the IOD in half.

I could imagine splitting off some I/O to chiplets similar to what has been done with MCDs could be technically feasible. But since IODs already are dominated by I/O interfaces (which dictates their size) anyway there may be no savings in doing that.

AMD already employs a NUMA structure internally, and many workloads can benefit significantly from enabling NUMA-aware settings in the BIOS. Dell describes some recommendations and benchmarks here: https://infohub.delltechnologies.com/p/amd-milan-bios-characterization-for-hpc/

So all that you'd be doing by splitting the IO die is adding a couple extra cycles of latency between some of the domains, which really isn't that bad. And an embedded bridge should be plenty sufficient bandwidth-wise.

Mopetar · Nov 18, 2022

There might be more reason to split an IO die if it was only doing IO and different market segments only differed based on need for a scaling amount of IO.

Now that AMD has put graphical capabilities on their desktop Zen CPUs, a split die isn't going to happen. It would require too much duplication of resources that a server IO die has no need for at all.

Furthermore, it's not so simple in just the server market. There is a niche where someone doesn't care all that much about the core count, but does want the maximum number of PCI lanes or memory channels.

Designing and building a split die isn't going to be as economical as one might at first assume. The requirements between desktop and server have diverged enough where you need two separate designs and server isn't as simple as matching the number of cores to the amount of IO either.

Exist50 · Nov 18, 2022

Mopetar said:
There might be more reason to split an IO die if it was only doing IO and different market segments only differed based on need for a scaling amount of IO.

Now that AMD has put graphical capabilities on their desktop Zen CPUs, a split die isn't going to happen. It would require too much duplication of resources that a server IO die has no need for at all.

Furthermore, it's not so simple in just the server market. There is a niche where someone doesn't care all that much about the core count, but does want the maximum number of PCI lanes or memory channels.

Designing and building a split die isn't going to be as economical as one might at first assume. The requirements between desktop and server have diverged enough where you need two separate designs and server isn't as simple as matching the number of cores to the amount of IO either.

Agreed that a split IO die for desktop or sharing an IO die between desktop and server definitely isn't in the cards. For users who just want IO and not core count, however, I don't see what the issue is. Just like today, they'd buy the full platform (in this theoretical case, w/ two IO dies), and just have fewer CCXs attached. I'm proposing it simply as a possible alternative to needing a different die between the low and high core count server chips (currently, Genoa vs Siena).

scineram · Nov 18, 2022

Exist50 said:
AMD already employs a NUMA structure internally, and many workloads can benefit significantly from enabling NUMA-aware settings in the BIOS. Dell describes some recommendations and benchmarks here: https://infohub.delltechnologies.com/p/amd-milan-bios-characterization-for-hpc/

So all that you'd be doing by splitting the IO die is adding a couple extra cycles of latency between some of the domains, which really isn't that bad. And an embedded bridge should be plenty sufficient bandwidth-wise.

So they should make the whole thing worse for almost no benefit. Ok.

Exist50 · Nov 18, 2022

scineram said:
So they should make the whole thing worse for almost no benefit. Ok.

There's a very clear benefit of only needing a single die for both markets, and potential yield/cost benefits for having a smaller die. Whether that would be worth the PnP overhead to the main server line, I didn't address.

moinmoin · Nov 18, 2022

Exist50 said:
And an embedded bridge should be plenty sufficient bandwidth-wise.

Well yes, but for whatever reason even on AM5 and SP5 AMD still relies on tried and true substrate MCM. Maybe embedded bridges come with the Zen 5 IOD or possibly the Zen 4c IOD already, maybe much later, we'll see.

Mopetar said:
There might be more reason to split an IO die if it was only doing IO and different market segments only differed based on need for a scaling amount of IO.

So segmentation based on I/O capability? AMD turned that around and made it one of its (especially server) platforms' core features that I/O capability is pretty much the same regardless of the core count. And since that naturally comes at a cost SP6 with Siena is coming.

Exist50 · Nov 18, 2022

moinmoin said:
Well yes, but for whatever reason even on AM5 and SP5 AMD still relies on tried and true substrate MCM. Maybe embedded bridges come with the Zen 5 IOD or possibly the Zen 4c IOD already, maybe much later, we'll see.

I don't think the two are necessarily contradictory. The bandwidth needs between two IO dies would be greater than from an IO die to an individual CCX. Though I wish they'd discuss their reasoning behind sticking with organic. Someone at AMD must have done the math.

moinmoin · Nov 18, 2022

Exist50 said:
I don't think the two are necessarily contradictory. The bandwidth needs between two IO dies would be greater than from an IO die to an individual CCX. Though I wish they'd discuss their reasoning behind sticking with organic. Someone at AMD must have done the math.

It's likely the plan of a steady evolution of spec improvements, part of which being that even in Zen 4 load/store bandwidth is that low that higher DDR5 speeds end up making no performance difference.

The big question is what was driving this decision, like three years ago looking into the future that's now: Organic substrate MCM still being the only feasible mainstream packaging solution? High speed DDR5 not existing or too expensive? Segmentation between gens, Zen 4 taking after Zen 3 and only Zen 5 overhauling load/store speeds?

jamescox · Nov 19, 2022

Mopetar said:
There might be more reason to split an IO die if it was only doing IO and different market segments only differed based on need for a scaling amount of IO.

Now that AMD has put graphical capabilities on their desktop Zen CPUs, a split die isn't going to happen. It would require too much duplication of resources that a server IO die has no need for at all.

Furthermore, it's not so simple in just the server market. There is a niche where someone doesn't care all that much about the core count, but does want the maximum number of PCI lanes or memory channels.

Designing and building a split die isn't going to be as economical as one might at first assume. The requirements between desktop and server have diverged enough where you need two separate designs and server isn't as simple as matching the number of cores to the amount of IO either.

I am not sure that is true. The consumer IO die is around 125 mm2; Genoa IO die is close to 400, even with extra switches and connectivity. The consumer IO die contains things not needed by servers, but that is easy to get around by stacking another chiplet on top (like a gpu chiplet) or embedding another die underneath with the needed functionality. If they use embedded die with micro-solder ball style stacking, then the embedded die can be made anywhere, like Global Foundries. The SoIC stacking would require both die to be made at TSMC. Stacking tech could allow for a very general base die since embedded die underneath can be used to attach to any type of memory, it can be independent of memory type. Just use a different embedded bridge die with a common interface to the top die.

You may be right partially right; I don’t know if I would expect to see something like this in most consumer parts. Most consumer parts should actually be APUs anyway. Maybe a high end consumer part with more than one stack, but I am not sure where that would fit in the market. For all but the highest end consumer parts, I have wondered if they basically could use an APU with GMI link(s) or just bridge die. Embed some low power Zen 4 based cores in an APU (basically an IO die) and then connect a Zen 5 chiplet for when more power is needed. That would make an excellent laptop chip and possibly cover most of the consumer product stack.

Splitting the server IO die would likely not be difficult. There are internal connectivity diagrams for it showing internal switches and such. It is split into 4 quadrants already with different latencies between them. It has Numa Per Socket (NPS) settings to take advantage of this. It can be set to NPS1, NPS2, or NPS4. The NPS settings also change the memory interleave. NPS1 interleaves across all 8 channels, NPS2 across 4 channels in each half, and NPS4 just interleaves across the 2 channels in each quadrant. These settings trade off between maximum bandwidth or lower latency, but require the application be numa aware. This seems like it would be very easy to split into separate die; it has never really been monolithic.

One other thing I have been thinking about is that Zen 5 will likely have massively increased FP power, so they will likely be adding HBM to HPC processors in addition to mixing CPU and GPU chiplets in the same package. This seems to imply that the cpu chiplets will need to be stacked somehow. I don’t know if it could be embedded die and/or GMI for everything. If you think about the layout of Genoa, with a centralized IO die, where would you put HBM? You want the HBM as close to the cpu or gpu cores as possible, not limited by a GMI link. How do you also scale it up to at least 8 compute chiplets? This makes me believe that the cpu cores may use similar, if not the same set-up as the GPUs. A base die with the cpu (or gpu or FPGA or whatever accelerator) stacked on top and then bridge chips to HBM or system memory would allow memory access without going through a GMI link.

The diagrams I have seen were showing that the next gen GPUs would be two elements, possibly stacks with embedded die and/or SoIC stacked on top. They are connected together with very high bandwidth, likely embedded bridge. They have HBM along one side and the other side is used to connect to another dual gpu module. Then two such sets can be connected together to make an 8 gpu chiplet device. This is why I was pointing at the diagram for crusher system here:

ORNL Publishes Overview of AMD-Powered 'Crusher' HPC System: 192 EPYC 'Trento' 64 Core CPUs, 1536 Instinct MI250X GPUs, 40 PFLOPs Horsepower

ORNL has published the overview of its Crusher system which is powered by AMD's Optimized 3rd Gen EPYC CPUs & Instinct MI250X GPUs.

wccftech.com

This crusher system seems like it is very a similar to upcoming systems that will mix CPUs and GPUs. This shows 200 GB/s link between adjacent gpus and 50 GB/s links for “remote” GPUs. It may actually be a test platform to some extent. The adjacent devices may move to silicon bridge connection. The MI250X appears to have 4x high speed gpu-gpu links (200 GB/s) for the adjacent gpu, 3 gpu-gpu links (50 GB/s each) for other GPUs or network (3 gpu or 2 gpu + 1 network), and one cpu-cpu link (36 GB/s) for connection to the cpu (off package). This seems to need 7x 50 GB/s links and 1x 36 GB/s link per GPU. That is a bit of die area, so moving that to a stacked die seems like a good idea. Also, they would want to use the same chiplets everywhere, so putting all of these links on the compute die itself doesn’t make that much sense. It would also waste die area on the compute die which generally uses the latest and greatest (and most expensive) process tech. Not all of the interfaces would be needed on all products, but splitting them out and making it on a cheaper node can be a win since you are wasting cheap silicon rather than expensive silicon.

It gets very difficult to speculate once 2.5D and 3D stacking become common since there are a lot of possibilities. This is more Zen 5 speculation rather than Zen 4. Although, we don’t seem to know exactly what Bergamo or Siena will actually be at this point. I was hoping for stacking with Bergamo, but is seems very unlikely to be anything other than a normal Genoa IO die. Also still hard to tell whether Siena will be salvage die only or a new IO die layout. With 64-cores, they have to do something more complicated than just half of a Genoa IO die. The 4 quadrants are somewhat independent, but it would only be able to support 6 chiplets with 2 quadrants, not 8. That means a completely new layout with a lot of units removed would be necessary.

igor_kavinski · Nov 19, 2022

AMD Ryzen Threadripper 7000 "Storm Peak" CPU spotted with 96 cores - VideoCardz.com

AMD Threadripper series getting more cores with “Storm Peak” New AMD OPN codes have shown up on Einstein @ Home website. It looks like someone is testing unreleased AMD CPUs with high core count. This is definitely not the first time we spot an unreleased AMD processor on volunteer computing...

videocardz.com

That would be the monster truck of non-server CPUs.

Joe NYC · Nov 19, 2022

moinmoin said:
Well yes, but for whatever reason even on AM5 and SP5 AMD still relies on tried and true substrate MCM. Maybe embedded bridges come with the Zen 5 IOD or possibly the Zen 4c IOD already, maybe much later, we'll see.

It seems that Bergamo and Sienna will just reuse Genoa resources, I don't see AMD breaking new ground with these chips, which CAN reuse existing resources.

I would look at the next innovation to come from Mi300 side.

moinmoin said:
So segmentation based on I/O capability? AMD turned that around and made it one of its (especially server) platforms' core features that I/O capability is pretty much the same regardless of the core count. And since that naturally comes at a cost SP6 with Siena is coming.

I agree with that. There will be enough segmentation with SP5 and SP6.

If anything, maybe AMD could go to even lower end, for very low end servers using AM5 socket.

moinmoin · Nov 19, 2022

Joe NYC said:
I would look at the next innovation to come from Mi300 side.

True, the Mi series certainly is where most of the interesting packaging tech development by AMD is happening right now.

Joe NYC said:
If anything, maybe AMD could go to even lower end, for very low end servers using AM5 socket.

"Server grade" AM5 boards with certified ECC support (not only on-die) would be really nice to have.

igor_kavinski said:
That would be the monster truck of non-server CPUs.

Threadripper being based on SP5 ensures its price will never be as low as it was in the first couple gens again.

eek2121 · Nov 19, 2022

maddie said:
Why do you think running your own fab with all the R&D costs included is always cheaper that outsourcing to a specialist? That is such a misguided belief I see repeated contiguously.

Well maybe all the firms just abandon client. Joking of course, but they will have to drop prices as we see happening. With major companies firing workers & consumers revaluing wants, the TAM is falling. That's a fact that cannot be dismissed.

Do we still don't understand what's happening?

TSMC has a margin of 49%. To simplify things, (it is more complicated) that means that on a $16,000 wafer, TSMC pockets $7,840 of pure profit. Intel doesn’t necessarily have to worry about the profitability of it’s fabs, as long as overall margins are strong. That means, assuming all things like costs are equal (they aren’t) Intel can make and sell a similar sized chip for cheaper than AMD ever could.

Costs are variable, however, but not variable by 50%.

maddie · Nov 19, 2022

eek2121 said:
TSMC has a margin of 49%. To simplify things, (it is more complicated) that means that on a $16,000 wafer, TSMC pockets $7,840 of pure profit. Intel doesn’t necessarily have to worry about the profitability of it’s fabs, as long as overall margins are strong. That means, assuming all things like costs are equal (they aren’t) Intel can make and sell a similar sized chip for cheaper than AMD ever could.

Costs are variable, however, but not variable by 50%.

Apples & Oranges.

What node is that TSMC $16,000 wafer? 5nm?
What node is Intel using for most advanced bulk processing? Intel 7?
What are the density differences?
What is the TSMC cost of that equivalent? $9,000 or less (TSMC 6)?

biostud · Nov 20, 2022

AMD Ryzen 7000 gets a massive price cut in Europe, 7950X now for just 669 EUR - VideoCardz.com

AMD Ryzen 9 7950X drops from 849 EUR to just 648 EUR It looks like AMD Ryzen 7000 series are getting much cheaper for European customers. Update: Ryzen 7000 series are now even cheaper, this post has been updated accordingly. Some retailers in Europe are beginning to discount Ryzen 7000 series...

videocardz.com

Competition is good, looking forward to CES announcements.

eek2121 · Nov 20, 2022

maddie said:
Apples & Oranges.

What node is that TSMC $16,000 wafer? 5nm?
What node is Intel using for most advanced bulk processing? Intel 7?
What are the density differences?
What is the TSMC cost of that equivalent? $9,000 or less (TSMC 6)?

Absolutely not apples and oranges. Bottom line is that it costs Intel less to make chips than AMD.

Abwx · Nov 20, 2022

biostud said:
AMD Ryzen 7000 gets a massive price cut in Europe, 7950X now for just 669 EUR - VideoCardz.com

AMD Ryzen 9 7950X drops from 849 EUR to just 648 EUR It looks like AMD Ryzen 7000 series are getting much cheaper for European customers. Update: Ryzen 7000 series are now even cheaper, this post has been updated accordingly. Some retailers in Europe are beginning to discount Ryzen 7000 series...

videocardz.com

Competition is good, looking forward to CES announcements.

Actually prices "dropped" down to AMD s announced MSRP...

Harry_Wild · Nov 20, 2022

biostud said:
AMD Ryzen 7000 gets a massive price cut in Europe, 7950X now for just 669 EUR - VideoCardz.com

AMD Ryzen 9 7950X drops from 849 EUR to just 648 EUR It looks like AMD Ryzen 7000 series are getting much cheaper for European customers. Update: Ryzen 7000 series are now even cheaper, this post has been updated accordingly. Some retailers in Europe are beginning to discount Ryzen 7000 series...

videocardz.com

Competition is good, looking forward to CES announcements.

I do see price discounts off now between $5 & $10 for 7950X! Only idiot would even think of being motivated to buy at these minor price discounts!

Markfw · Nov 20, 2022

Harry_Wild said:
I do see price discounts off now between $5 & $10 for 7950X! Only idiot would even think of being motivated to buy at these minor price discounts!

Maybe that all you see, but it was over $186 less. And you did not see that those prices included VAT ?

Go troll somewhere else.

nicalandia · Nov 20, 2022

moinmoin said:
Threadripper being based on SP5 ensures its price will never be as low as it was in the first couple gens again.

As it should an unlocked 96C/192T Monster should command an Absurd Amount of money because it will provide Absurd Performance.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Lifer

Diamond Member

Platinum Member

Diamond Member

Member

Platinum Member

Diamond Member

Platinum Member

Senior member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Senior member

Lifer

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Senior member

Moderator Emeritus, Elite Member

Diamond Member