Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

Joe NYC · Mar 19, 2023

eek2121 said:
Not likely, there simply isn't enough space. Some folks have attempted a mockup, but mockups ignore the electrical connections, and more importantly, the distance between those connections. For AMD to increase core count they would have to do one of the following:

decrease the CCD size.

decrease the IOD size.

stack low performance 'small' cores on top of one of the dies. (not likely, heat is an issue)

Use a 'hybrid' design.

move to a monolithic die.

Use a dense process

Some combination of above.

Having just held the AM5 CPU in my held, it is surprisingly tiny. The area left under the heatsink is quite small after you subtract the cutouts.

Looking at the picture, and if the die sizes are to stay approximately the same, will take reorganizing the MCD to get an extra CCD in..

A/// · Mar 19, 2023

eek2121 said:
Not likely, there simply isn't enough space. Some folks have attempted a mockup, but mockups ignore the electrical connections, and more importantly, the distance between those connections. For AMD to increase core count they would have to do one of the following:

decrease the CCD size.

decrease the IOD size.

stack low performance 'small' cores on top of one of the dies. (not likely, heat is an issue)

Use a 'hybrid' design.

move to a monolithic die.

Use a dense process

Some combination of above.

Denser core chiplets seem like the most obvious choice while constrained on the current total package size including substrate base but as I said in the past this does bring a critical motion to the nature of the core density and heat. A larger overall size wouldn't be an issue if there were space on boards today and given the added junk on boards that could be offset with enterprise trickle we're not going to see anything dramatic.

Kocicak · Mar 20, 2023

eek2121 said:
Not likely, there simply isn't enough space. Some folks have attempted a mockup, but mockups ignore the electrical connections, and more importantly, the distance between those connections. ...

One of the very practical solutions would be to have a IO die with a few cores integrated in it (4-8), which could work as a standalone CPU and serve the lowest part of the market, with additional two chiplets connectable to it.

You could build a lot of very interesting PC CPUs from these parts:

IO die with weak graphic card and up to 8 cores on it
chiplet with graphic card or graphic extention to what is in the IO die
chiplet with performance cores
chiplet with larger number of compact cores
some other specialized accelerator chiplets, which would be usable in small workstations

uzzi38 · Mar 20, 2023

DisEnchantment said:
You meant this slide below?
View attachment 78408

Note 1 in the slide above seems to suggest that the next gen SP5 part can attain DDR5-6400 frequencies, which is a decent bump. DT will likely go beyond that.
However in Zen 4, fclk and mclk are not 1:1. So higher DDR5 speed does not really mean a higher fabric clock.
Due to the fabric clock limit, at max ~2GHz currently, there is a threshold beyond which increasing the RAM speed has no impact on the memory latency.
IFOP clock would have likely been capped at 2GHz since Zen 2 due to insertion losses and the high 2pJ/bit energy usage.

They don't really need an interposer to improve this and on package RDL in the N31 MCDs seems good enough.
It seems AMD got 0.4pJ/bit on the N31 RDL fanout links compared to 0.2-0.3 pJ /bit for GUC GLink. I did see 64 Gbps links on LinkedIn mentioned for new GMI.

Going forward, if Strix being chiplet is really a thing, they are going to absolutely need a new interconnect considering Dragon Range is idling at 10W (granted Power Gating is not as fine grained as in the purpose built Mobile APU and not sure if there is scaling of the fclk and lanes as per the workload)

Stretch goal == best case scenario.

Definitely not a given. If anything AMD ran into more issues than expected with Genoa, so even these may have been ambitious. I would definitely wait and see what final advertised memory speeds are.

yuri69 · Mar 20, 2023

Joe NYC said:
BTW, it is interesting that AMD hides these details (very effectively) on the CPU - Zen 5 side, while sharing info on Mi300 datacenter GPU side...

AMD needs to hype its AI/ML/datacenter GPU business. They need to constantly ensure the investors they are still trying to get into the game for that lucrative segment of AI. Hence the MI300 show.

That market situation is completely different from the datacenter CPU market and roadmap. Rome, Milan, and Genoa have been doing fine. Thus there is no need to spend time hyping Turin.

moinmoin · Mar 20, 2023

yuri69 said:
Rome, Milan, and Genoa have been doing fine. Thus there is no need to spend time hyping Turin.

Add to this that the main hypable USP of Epyc chips (amount of cores) have been plenty obvious so far.

Timmah! · Mar 20, 2023

Mopetar said:
Intel can actually hang with AMD in MT workloads now, and while they certainly have an efficiency edge there are plenty of people who don't care about such things.

The kind of people who would want a 24-32 core desktop CPU are those who are probably running rendering software that will gladly scale beyond 16 cores. Core to core latency doesn't matter much in those cases.

There are plenty of workloads that wouldn't care if AMDs solution to offering a 24-core CPU was just adding another chiplet anymore than they would about AMD putting 12 cores on a CCD.

Just don't expect it to be cheap assuming it does exist. AMD would likely try to maintain pricing based on the number of cores similar to the previous generation and use this as an excuse to have a $1,000 CPU. But it's less expensive that Threadripper so some people will buy it.

24 cores with 2x 12 cores would be indeed great. And IMO more likely to happen than 3x 8 core chips, as that would be a pain to fit under that IHS. If they intend to bring 24 cores to the AM5 socket at some point, with Zen 5 or 6 or whatever, i think this will be the way. If they are not going with 12 cores, then i dont see 24 cores happening until next socket.

Improving the inter-chip connectivity from that crappy 2 GHz IF would be nice as well.

Anyway, its bit disappointing they increase the server core counts like there is no tommorow by each passing generation, nearing to 200 in near future, yet stay at 16 in desktop and seemingly intend to do so for foreseeable future. I almost hope for RPL refresh to come with 32 of those small cores to force AMDs hand in this matter.

BTW, are the chiplets in Ryzen connected only to IO chip, or between themselves as well? So if cores on one chip needs to communicate with cores on the other one, does this need to happen via the IO chip, or not?

Geddagod · Mar 20, 2023

BTW, are the chiplets in Ryzen connected only to IO chip, or between themselves as well? So if cores on one chip needs to communicate with cores on the other one, does this need to happen via the IO chip, or not?

I'm pretty sure there is no direct routing between chiplets. It has to go through the IO chiplet itself.
It goes to IO, but the IO is able to know if another CCD has the data needed, at which point it retrieves it from the other CCD, if not from the RAM.
Could be wrong tho, but that's how I remember it.

Timmah! · Mar 20, 2023

Geddagod said:
I'm pretty sure there is no direct routing between chiplets. It has to go through the IO chiplet itself.
It goes to IO, but the IO is able to know if another CCD has the data needed, at which point it retrieves it from the other CCD, if not from the RAM.
Could be wrong tho, but that's how I remember it.

Is the bolded the reason why this is the case? Would not be otherwise better to have the 2 chips being able to communicate to each other directly without the need to go via IO chip? And if the other chip does not have the data, then retrieve it from RAM (obviously now IO chip would have to be included in that).
Anyway, i recall in Zen 1 days, in first Threadrippers, when there was no IO chip, the CCDs used to be connected to each other, right?

Geddagod · Mar 20, 2023

Timmah! said:
Is the bolded the reason why this is the case? Would not be otherwise better to have the 2 chips being able to communicate to each other directly without the need to go via IO chip? And if the other chip does not have the data, then retrieve it from RAM (obviously now IO chip would have to be included in that).
Anyway, i recall in Zen 1 days, in first Threadrippers, when there was no IO chip, the CCDs used to be connected to each other, right?

Don't know about that, but IIRC even for zen 2, when the CCXs were literarily right next to eachother on the same CCD, they still had to go through the IO die to communicate with eachother's L3.
AFAIK if you want the 2 CCDs to be able to communicate together, you need an IMC or something of that sort on each of the compute dies, like Intel does, since the L3 for each one of the CCDs are separate.
However that also removes the advantage of not having IO on the CCD for better cost savings.

moinmoin · Mar 20, 2023

The next level of memory accesses after L3$ is the RAM, that's why any access beyond the CCX goes through the IOD/IMC at that point.

This also means unless AMD is beefing up IFOPs tremendously, the only way to give more cores faster (than RAM) access between each other is increasing the CCX size.

LightningZ71 · Mar 20, 2023

They don't get to increase IFOP performance, or any sort of chip to chip communications, to a significant degree until they start to stack chips. I envision a possible world where AMD has an N6 IOD with two CCDs stacked on top of it, one with 8 HP cores and one with 16 HD cores. They can easily have 2-4X the amount of bandwidth between the CCXs and the IOD by doing that. Having the heat spreader so thick on the current AM5 socket allows them to increase the chip Z-height on the MCP chips by thinning the heat spreader. Such a setup would easily best the 13900K on MT tasks and may even get a good boost in ST tasks as the latency for memory access should be notably lower.

Don't get me wrong, there ARE ways of improving chip to chip communications without stacking, but they are costly and likely more prone to yield issues.

HurleyBird · Mar 20, 2023

Timmah! said:
Anyway, i recall in Zen 1 days, in first Threadrippers, when there was no IO chip, the CCDs used to be connected to each other, right?

Don't think they were called CCDs, but yup. You could even have chiplets with their memory controller disabled.

Geddagod said:
for zen 2, when the CCXs were literarily right next to eachother on the same CCD, they still had to go through the IO die to communicate with eachother's L3.

Yup.

Saylick · Mar 20, 2023

LightningZ71 said:
They don't get to increase IFOP performance, or any sort of chip to chip communications, to a significant degree until they start to stack chips. I envision a possible world where AMD has an N6 IOD with two CCDs stacked on top of it, one with 8 HP cores and one with 16 HD cores. They can easily have 2-4X the amount of bandwidth between the CCXs and the IOD by doing that. Having the heat spreader so thick on the current AM5 socket allows them to increase the chip Z-height on the MCP chips by thinning the heat spreader. Such a setup would easily best the 13900K on MT tasks and may even get a good boost in ST tasks as the latency for memory access should be notably lower.

Don't get me wrong, there ARE ways of improving chip to chip communications without stacking, but they are costly and likely more prone to yield issues.

I wonder if AMD can use the same tech they used for RDNA 3, i.e. that high performance fanout, instead of going all the way to the stacking route. Only issue I see with the HP fanout approach is that the dies likely need to be super close to each other, and I'm not sure that is viable for server products where the CCDs can be quite far from the IOD.

moinmoin · Mar 20, 2023

Saylick said:
I wonder if AMD can use the same tech they used for RDNA 3, i.e. that high performance fanout, instead of going all the way to the stacking route. Only issue I see with the HP fanout approach is that the dies likely need to be super close to each other, and I'm not sure that is viable for server products where the CCDs can be quite far from the IOD.

The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.

maddie · Mar 20, 2023

moinmoin said:
The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.

The MCDs placed the cache in another chiplet when the previous IF cache was on die and close. I don't think the fanout is itself inherently high latency and anyhow, the chiplet CPUs already have the IOD some distance away.

Saylick · Mar 20, 2023

moinmoin said:
The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.

But isn't the latency hit from the transition to an MCM approach due to the fact that a monolithic design will always have better latency? Conversely, the current chiplet approach for their CPUs already took this latency hit so I doubt using shorter, denser wires via the HP fanout will add even more latency than the current SerDES IFOPs, but I could be mistaken.

Edit: Haha, just saw @maddie's post. Yes, that point precisely.

moinmoin · Mar 20, 2023

maddie said:
The MCDs placed the cache in another chiplet when the previous IF cache was on die and close. I don't think the fanout is itself inherently high latency and anyhow, the chiplet CPUs already have the IOD some distance away.

Saylick said:
But isn't the latency hit from the transition to an MCM approach due to the fact that a monolithic design will always have better latency? Conversely, the current chiplet approach for their CPUs already took this latency hit so I doubt using shorter, denser wires via the HP fanout will add even more latency than the current SerDES IFOPs, but I could be mistaken.

Edit: Haha, just saw @maddie's post. Yes, that point precisely.

Both true, points taken. My point was the fanout design as part of the MCDs was focused on bandwidth whereas CPUs by far don't have the same bandwidth needs, but latency is more important.

May well be the case that less bandwidth allows the fanout die to be smaller and cheaper and be a feasible replacement for SerDes and that the lack of need for SerDes as well as some substrate distance actually saves latency. (Talking of IFOPs SerDes, wonder how much of their die area would still be needed just for fanout I/O.)

BorisTheBlade82 · Mar 20, 2023

eek2121 said:
Not likely, there simply isn't enough space. Some folks have attempted a mockup, but mockups ignore the electrical connections, and more importantly, the distance between those connections. For AMD to increase core count they would have to do one of the following:

decrease the CCD size.

decrease the IOD size.

stack low performance 'small' cores on top of one of the dies. (not likely, heat is an issue)

Use a 'hybrid' design.

move to a monolithic die.

Use a dense process

Some combination of above.

Just look at how close the dies are placed to each other on MI300 and N31.
So even if you expect the same area to be lost through caps and stuff, it is not impossible to place 3 CCD and 1 IOD of similar size on the same area.
Of course you would need to make trade-offs regarding the geometries. Rather lengthy CCDs as well as a very long and narrow IOD to form an E might do the trick. On the common borders you would need enough beachfront in order to make the InFO-R connections and ports. This does not need to be very wide, but more on this later...
So, all in all, I fail to see why this should not be absolutely possible with another form of physical Interconnect implementation.

But a hybrid design is definitely on the cards as well. It just remains to be seen, how this could look like aside from the monolithic PHX2.

Timmah! said:
Is the bolded the reason why this is the case? Would not be otherwise better to have the 2 chips being able to communicate to each other directly without the need to go via IO chip?

Firstly, this involves a lot of production cost increases because of more expensive packaging as well as more die area for the CCDs because of the additional ports.
Furthermore, there is a heavy diminishing return. With only 64/32 GByte/s access to another CCD's L3 and horrible latencies the result would be negligible.
And lastly there are only oh, so few common workloads with this much inter-CCD-traffic. One simply has to ask themselves, why Intel is going through all this pain with SPR. This, by the way, was also mentioned by C'n'C in their SPR article.

LightningZ71 said:
They don't get to increase IFOP performance, or any sort of chip to chip communications, to a significant degree until they start to stack chips.

What?
With InFO-R they can massively improve bandwidth, as per mm of Beachfront you get quite a lot of that. But they don't even have to. As stated before, Inter-CCX-traffic simply is not worth the pain. They would just need to - let's say - increase to 256/128 Gbyte/s in order to get the best out of whatever dual channel RAM might be available in the next couple of years even with only one CCD.

BorisTheBlade82 · Mar 20, 2023

moinmoin said:
The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.

If I had a farm, I would bet it for Infinity Fan-out Links, respectively InFO-RDL, to have at least the same, if not significantly lower latencies than IFoP. If you meant that latency was worse than an IMC, then yes, the monolithic would be in front.

Joe NYC · Mar 20, 2023

Timmah! said:
BTW, are the chiplets in Ryzen connected only to IO chip, or between themselves as well? So if cores on one chip needs to communicate with cores on the other one, does this need to happen via the IO chip, or not?

Even though each CCD has 2 GMI links, it is not used in 2 CCD CPUs (like 7950X) for direct communication between CCDs. It is basically not connected, as far as I know.

This would be more of a special case, and I doubt AMD even accounts for the possibility in the algorithm.

Also, the IO die has 2 links, and a CCD has 2 links. So in theory, single CCD CPUs, like 7700X could use both of them, but again, as far as we know, the 2nd link is unused.

Joe NYC · Mar 20, 2023

Saylick said:
I wonder if AMD can use the same tech they used for RDNA 3, i.e. that high performance fanout, instead of going all the way to the stacking route. Only issue I see with the HP fanout approach is that the dies likely need to be super close to each other, and I'm not sure that is viable for server products where the CCDs can be quite far from the IOD.

I think it is a question of timing. IMO, eventually, on client side, AMD will move to stacking CCDs on the IO die, which will also have SRAM cache, but this is probably not going to be ready for Zen 5. But fanout RDL is ready and can be used.

So maybe fanout RDL for Zen 5, stacking for Zen 6.

Or it could be in parallel. If AMD has a stacked Zen 5 cores for Mi400, maybe they can be brought to select client CPUs, while the majority uses RDL for Zen 5.

Joe NYC · Mar 20, 2023

moinmoin said:
The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.

The RDL fanout connection is parallel, unlike SerDes which is serial. So the latency should be lower with a parallel link.

PJVol · Mar 20, 2023

Saylick said:
I wonder if AMD can use the same tech they used for RDNA 3

Zen 2-4 fabric topology is much more complex unlike the rdna3 "star", taking into account multi-cpu configs. Even client-side cpu uses 4 layers just for IFOP routing. Add to this power RDLs, interfaces route layers, etc.

DisEnchantment · Mar 21, 2023

Joe NYC said:
The RDL fanout connection is parallel, unlike SerDes which is serial. So the latency should be lower with a parallel link.

RDL fanout links are just signal links, you can still serialize the data through if you want. And usually there is a higher level encoding and error detection/correction scheme as well.
At the front of the link lies the PHY/Line driver which is where the bulk of the power is burnt from all the insertion losses takes place due to the impedance of the traces.
The serialization (and usually with compression) is on die and is relatively cheap compared to the transmission in terms of energy consumption.

Because the excitation of the signal pulse over the line has a delay due to impedance, there is some circutry like TCoils to counteract the capacitive behavior of the line. This puts a limit on the frequency that can be achieved over the traces. But TCoils occupy some space and making more Tx/Rx lines is expensive in terms of die area (and power).

The advantage of RDL is therefore trace density which means you can have more lines and lesser attenuation (with the help of some dielectric magic in between the layers). But dies need to be closer to each other otherwise again some kind of line driver will be needed. And this makes the next logical step after RDL to be LSI, with even more trace density and active switching to repeat the signal at native operating voltages over a bigger distance and eliminating the PHY totally.

So this is a long way to say yes you can do full parallel access over the RDL links given the trace density and frequency improvement, but likely you'd still have some serialization or signal muxing and demuxing in between (depending on how far the dies are separated you might even need a low power line driver)

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Platinum Member

Diamond Member

Senior member

Platinum Member

Senior member

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member

Diamond Member

Golden Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Senior member

Platinum Member

Platinum Member

Platinum Member

Senior member

Golden Member