Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 31 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Joe NYC

Platinum Member
Jun 26, 2021
2,323
2,929
106
Not likely, there simply isn't enough space. Some folks have attempted a mockup, but mockups ignore the electrical connections, and more importantly, the distance between those connections. For AMD to increase core count they would have to do one of the following:
  1. decrease the CCD size.
  2. decrease the IOD size.
  3. stack low performance 'small' cores on top of one of the dies. (not likely, heat is an issue)
  4. Use a 'hybrid' design.
  5. move to a monolithic die.
  6. Use a dense process
  7. Some combination of above.

Having just held the AM5 CPU in my held, it is surprisingly tiny. The area left under the heatsink is quite small after you subtract the cutouts.

Looking at the picture, and if the die sizes are to stay approximately the same, will take reorganizing the MCD to get an extra CCD in..

 

A///

Diamond Member
Feb 24, 2017
4,352
3,155
136
Not likely, there simply isn't enough space. Some folks have attempted a mockup, but mockups ignore the electrical connections, and more importantly, the distance between those connections. For AMD to increase core count they would have to do one of the following:
  1. decrease the CCD size.
  2. decrease the IOD size.
  3. stack low performance 'small' cores on top of one of the dies. (not likely, heat is an issue)
  4. Use a 'hybrid' design.
  5. move to a monolithic die.
  6. Use a dense process
  7. Some combination of above.
Denser core chiplets seem like the most obvious choice while constrained on the current total package size including substrate base but as I said in the past this does bring a critical motion to the nature of the core density and heat. A larger overall size wouldn't be an issue if there were space on boards today and given the added junk on boards that could be offset with enterprise trickle we're not going to see anything dramatic.
 

Kocicak

Senior member
Jan 17, 2019
982
973
136
Not likely, there simply isn't enough space. Some folks have attempted a mockup, but mockups ignore the electrical connections, and more importantly, the distance between those connections. ...

One of the very practical solutions would be to have a IO die with a few cores integrated in it (4-8), which could work as a standalone CPU and serve the lowest part of the market, with additional two chiplets connectable to it.

You could build a lot of very interesting PC CPUs from these parts:
  • IO die with weak graphic card and up to 8 cores on it
  • chiplet with graphic card or graphic extention to what is in the IO die
  • chiplet with performance cores
  • chiplet with larger number of compact cores
  • some other specialized accelerator chiplets, which would be usable in small workstations
 
Last edited:

uzzi38

Platinum Member
Oct 16, 2019
2,698
6,393
146
You meant this slide below?
View attachment 78408

Note 1 in the slide above seems to suggest that the next gen SP5 part can attain DDR5-6400 frequencies, which is a decent bump. DT will likely go beyond that.
However in Zen 4, fclk and mclk are not 1:1. So higher DDR5 speed does not really mean a higher fabric clock.
Due to the fabric clock limit, at max ~2GHz currently, there is a threshold beyond which increasing the RAM speed has no impact on the memory latency.
IFOP clock would have likely been capped at 2GHz since Zen 2 due to insertion losses and the high 2pJ/bit energy usage.

They don't really need an interposer to improve this and on package RDL in the N31 MCDs seems good enough.
It seems AMD got 0.4pJ/bit on the N31 RDL fanout links compared to 0.2-0.3 pJ /bit for GUC GLink. I did see 64 Gbps links on LinkedIn mentioned for new GMI.

Going forward, if Strix being chiplet is really a thing, they are going to absolutely need a new interconnect considering Dragon Range is idling at 10W (granted Power Gating is not as fine grained as in the purpose built Mobile APU and not sure if there is scaling of the fclk and lanes as per the workload)
Stretch goal == best case scenario.

Definitely not a given. If anything AMD ran into more issues than expected with Genoa, so even these may have been ambitious. I would definitely wait and see what final advertised memory speeds are.
 
Reactions: Tlh97 and Exist50

yuri69

Senior member
Jul 16, 2013
433
714
136
BTW, it is interesting that AMD hides these details (very effectively) on the CPU - Zen 5 side, while sharing info on Mi300 datacenter GPU side...
AMD needs to hype its AI/ML/datacenter GPU business. They need to constantly ensure the investors they are still trying to get into the game for that lucrative segment of AI. Hence the MI300 show.

That market situation is completely different from the datacenter CPU market and roadmap. Rome, Milan, and Genoa have been doing fine. Thus there is no need to spend time hyping Turin.
 

Timmah!

Golden Member
Jul 24, 2010
1,453
709
136
Intel can actually hang with AMD in MT workloads now, and while they certainly have an efficiency edge there are plenty of people who don't care about such things.

The kind of people who would want a 24-32 core desktop CPU are those who are probably running rendering software that will gladly scale beyond 16 cores. Core to core latency doesn't matter much in those cases.

There are plenty of workloads that wouldn't care if AMDs solution to offering a 24-core CPU was just adding another chiplet anymore than they would about AMD putting 12 cores on a CCD.

Just don't expect it to be cheap assuming it does exist. AMD would likely try to maintain pricing based on the number of cores similar to the previous generation and use this as an excuse to have a $1,000 CPU. But it's less expensive that Threadripper so some people will buy it.

24 cores with 2x 12 cores would be indeed great. And IMO more likely to happen than 3x 8 core chips, as that would be a pain to fit under that IHS. If they intend to bring 24 cores to the AM5 socket at some point, with Zen 5 or 6 or whatever, i think this will be the way. If they are not going with 12 cores, then i dont see 24 cores happening until next socket.

Improving the inter-chip connectivity from that crappy 2 GHz IF would be nice as well.

Anyway, its bit disappointing they increase the server core counts like there is no tommorow by each passing generation, nearing to 200 in near future, yet stay at 16 in desktop and seemingly intend to do so for foreseeable future. I almost hope for RPL refresh to come with 32 of those small cores to force AMDs hand in this matter.

BTW, are the chiplets in Ryzen connected only to IO chip, or between themselves as well? So if cores on one chip needs to communicate with cores on the other one, does this need to happen via the IO chip, or not?
 
Reactions: Tlh97 and Geddagod

Geddagod

Golden Member
Dec 28, 2021
1,201
1,164
106
BTW, are the chiplets in Ryzen connected only to IO chip, or between themselves as well? So if cores on one chip needs to communicate with cores on the other one, does this need to happen via the IO chip, or not?
I'm pretty sure there is no direct routing between chiplets. It has to go through the IO chiplet itself.
It goes to IO, but the IO is able to know if another CCD has the data needed, at which point it retrieves it from the other CCD, if not from the RAM.
Could be wrong tho, but that's how I remember it.
 
Reactions: Tlh97

Timmah!

Golden Member
Jul 24, 2010
1,453
709
136
I'm pretty sure there is no direct routing between chiplets. It has to go through the IO chiplet itself.
It goes to IO, but the IO is able to know if another CCD has the data needed, at which point it retrieves it from the other CCD, if not from the RAM.
Could be wrong tho, but that's how I remember it.

Is the bolded the reason why this is the case? Would not be otherwise better to have the 2 chips being able to communicate to each other directly without the need to go via IO chip? And if the other chip does not have the data, then retrieve it from RAM (obviously now IO chip would have to be included in that).
Anyway, i recall in Zen 1 days, in first Threadrippers, when there was no IO chip, the CCDs used to be connected to each other, right?
 

Geddagod

Golden Member
Dec 28, 2021
1,201
1,164
106
Is the bolded the reason why this is the case? Would not be otherwise better to have the 2 chips being able to communicate to each other directly without the need to go via IO chip? And if the other chip does not have the data, then retrieve it from RAM (obviously now IO chip would have to be included in that).
Anyway, i recall in Zen 1 days, in first Threadrippers, when there was no IO chip, the CCDs used to be connected to each other, right?
Don't know about that, but IIRC even for zen 2, when the CCXs were literarily right next to eachother on the same CCD, they still had to go through the IO die to communicate with eachother's L3.
AFAIK if you want the 2 CCDs to be able to communicate together, you need an IMC or something of that sort on each of the compute dies, like Intel does, since the L3 for each one of the CCDs are separate.
However that also removes the advantage of not having IO on the CCD for better cost savings.
 

LightningZ71

Golden Member
Mar 10, 2017
1,655
1,939
136
They don't get to increase IFOP performance, or any sort of chip to chip communications, to a significant degree until they start to stack chips. I envision a possible world where AMD has an N6 IOD with two CCDs stacked on top of it, one with 8 HP cores and one with 16 HD cores. They can easily have 2-4X the amount of bandwidth between the CCXs and the IOD by doing that. Having the heat spreader so thick on the current AM5 socket allows them to increase the chip Z-height on the MCP chips by thinning the heat spreader. Such a setup would easily best the 13900K on MT tasks and may even get a good boost in ST tasks as the latency for memory access should be notably lower.

Don't get me wrong, there ARE ways of improving chip to chip communications without stacking, but they are costly and likely more prone to yield issues.
 

HurleyBird

Platinum Member
Apr 22, 2003
2,725
1,342
136
Anyway, i recall in Zen 1 days, in first Threadrippers, when there was no IO chip, the CCDs used to be connected to each other, right?

Don't think they were called CCDs, but yup. You could even have chiplets with their memory controller disabled.

for zen 2, when the CCXs were literarily right next to eachother on the same CCD, they still had to go through the IO die to communicate with eachother's L3.

Yup.
 

Saylick

Diamond Member
Sep 10, 2012
3,372
7,103
136
They don't get to increase IFOP performance, or any sort of chip to chip communications, to a significant degree until they start to stack chips. I envision a possible world where AMD has an N6 IOD with two CCDs stacked on top of it, one with 8 HP cores and one with 16 HD cores. They can easily have 2-4X the amount of bandwidth between the CCXs and the IOD by doing that. Having the heat spreader so thick on the current AM5 socket allows them to increase the chip Z-height on the MCP chips by thinning the heat spreader. Such a setup would easily best the 13900K on MT tasks and may even get a good boost in ST tasks as the latency for memory access should be notably lower.

Don't get me wrong, there ARE ways of improving chip to chip communications without stacking, but they are costly and likely more prone to yield issues.
I wonder if AMD can use the same tech they used for RDNA 3, i.e. that high performance fanout, instead of going all the way to the stacking route. Only issue I see with the HP fanout approach is that the dies likely need to be super close to each other, and I'm not sure that is viable for server products where the CCDs can be quite far from the IOD.
 
Reactions: Gideon

moinmoin

Diamond Member
Jun 1, 2017
4,993
7,763
136
I wonder if AMD can use the same tech they used for RDNA 3, i.e. that high performance fanout, instead of going all the way to the stacking route. Only issue I see with the HP fanout approach is that the dies likely need to be super close to each other, and I'm not sure that is viable for server products where the CCDs can be quite far from the IOD.
The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.
 

maddie

Diamond Member
Jul 18, 2010
4,787
4,771
136
The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.
The MCDs placed the cache in another chiplet when the previous IF cache was on die and close. I don't think the fanout is itself inherently high latency and anyhow, the chiplet CPUs already have the IOD some distance away.
 

Saylick

Diamond Member
Sep 10, 2012
3,372
7,103
136
The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.
But isn't the latency hit from the transition to an MCM approach due to the fact that a monolithic design will always have better latency? Conversely, the current chiplet approach for their CPUs already took this latency hit so I doubt using shorter, denser wires via the HP fanout will add even more latency than the current SerDES IFOPs, but I could be mistaken.

Edit: Haha, just saw @maddie's post. Yes, that point precisely.
 
Reactions: Tlh97 and Thibsie

moinmoin

Diamond Member
Jun 1, 2017
4,993
7,763
136
The MCDs placed the cache in another chiplet when the previous IF cache was on die and close. I don't think the fanout is itself inherently high latency and anyhow, the chiplet CPUs already have the IOD some distance away.
But isn't the latency hit from the transition to an MCM approach due to the fact that a monolithic design will always have better latency? Conversely, the current chiplet approach for their CPUs already took this latency hit so I doubt using shorter, denser wires via the HP fanout will add even more latency than the current SerDES IFOPs, but I could be mistaken.

Edit: Haha, just saw @maddie's post. Yes, that point precisely.
Both true, points taken. My point was the fanout design as part of the MCDs was focused on bandwidth whereas CPUs by far don't have the same bandwidth needs, but latency is more important.

May well be the case that less bandwidth allows the fanout die to be smaller and cheaper and be a feasible replacement for SerDes and that the lack of need for SerDes as well as some substrate distance actually saves latency. (Talking of IFOPs SerDes, wonder how much of their die area would still be needed just for fanout I/O.)
 

BorisTheBlade82

Senior member
May 1, 2020
667
1,022
136
Not likely, there simply isn't enough space. Some folks have attempted a mockup, but mockups ignore the electrical connections, and more importantly, the distance between those connections. For AMD to increase core count they would have to do one of the following:
  1. decrease the CCD size.
  2. decrease the IOD size.
  3. stack low performance 'small' cores on top of one of the dies. (not likely, heat is an issue)
  4. Use a 'hybrid' design.
  5. move to a monolithic die.
  6. Use a dense process
  7. Some combination of above.
Just look at how close the dies are placed to each other on MI300 and N31.
So even if you expect the same area to be lost through caps and stuff, it is not impossible to place 3 CCD and 1 IOD of similar size on the same area.
Of course you would need to make trade-offs regarding the geometries. Rather lengthy CCDs as well as a very long and narrow IOD to form an E might do the trick. On the common borders you would need enough beachfront in order to make the InFO-R connections and ports. This does not need to be very wide, but more on this later...
So, all in all, I fail to see why this should not be absolutely possible with another form of physical Interconnect implementation.

But a hybrid design is definitely on the cards as well. It just remains to be seen, how this could look like aside from the monolithic PHX2.

Is the bolded the reason why this is the case? Would not be otherwise better to have the 2 chips being able to communicate to each other directly without the need to go via IO chip?
Firstly, this involves a lot of production cost increases because of more expensive packaging as well as more die area for the CCDs because of the additional ports.
Furthermore, there is a heavy diminishing return. With only 64/32 GByte/s access to another CCD's L3 and horrible latencies the result would be negligible.
And lastly there are only oh, so few common workloads with this much inter-CCD-traffic. One simply has to ask themselves, why Intel is going through all this pain with SPR. This, by the way, was also mentioned by C'n'C in their SPR article.

They don't get to increase IFOP performance, or any sort of chip to chip communications, to a significant degree until they start to stack chips.
What?
With InFO-R they can massively improve bandwidth, as per mm of Beachfront you get quite a lot of that. But they don't even have to. As stated before, Inter-CCX-traffic simply is not worth the pain. They would just need to - let's say - increase to 256/128 Gbyte/s in order to get the best out of whatever dual channel RAM might be available in the next couple of years even with only one CCD.
 

BorisTheBlade82

Senior member
May 1, 2020
667
1,022
136
The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.
If I had a farm, I would bet it for Infinity Fan-out Links, respectively InFO-RDL, to have at least the same, if not significantly lower latencies than IFoP. If you meant that latency was worse than an IMC, then yes, the monolithic would be in front.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,323
2,929
106
BTW, are the chiplets in Ryzen connected only to IO chip, or between themselves as well? So if cores on one chip needs to communicate with cores on the other one, does this need to happen via the IO chip, or not?

Even though each CCD has 2 GMI links, it is not used in 2 CCD CPUs (like 7950X) for direct communication between CCDs. It is basically not connected, as far as I know.

This would be more of a special case, and I doubt AMD even accounts for the possibility in the algorithm.

Also, the IO die has 2 links, and a CCD has 2 links. So in theory, single CCD CPUs, like 7700X could use both of them, but again, as far as we know, the 2nd link is unused.
 
Last edited:

Joe NYC

Platinum Member
Jun 26, 2021
2,323
2,929
106
I wonder if AMD can use the same tech they used for RDNA 3, i.e. that high performance fanout, instead of going all the way to the stacking route. Only issue I see with the HP fanout approach is that the dies likely need to be super close to each other, and I'm not sure that is viable for server products where the CCDs can be quite far from the IOD.

I think it is a question of timing. IMO, eventually, on client side, AMD will move to stacking CCDs on the IO die, which will also have SRAM cache, but this is probably not going to be ready for Zen 5. But fanout RDL is ready and can be used.

So maybe fanout RDL for Zen 5, stacking for Zen 6.

Or it could be in parallel. If AMD has a stacked Zen 5 cores for Mi400, maybe they can be brought to select client CPUs, while the majority uses RDL for Zen 5.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,323
2,929
106
The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.

The RDL fanout connection is parallel, unlike SerDes which is serial. So the latency should be lower with a parallel link.
 

PJVol

Senior member
May 25, 2020
616
547
136
I wonder if AMD can use the same tech they used for RDNA 3
Zen 2-4 fabric topology is much more complex unlike the rdna3 "star", taking into account multi-cpu configs. Even client-side cpu uses 4 layers just for IFOP routing. Add to this power RDLs, interfaces route layers, etc.
 
Last edited:
Reactions: lightmanek

DisEnchantment

Golden Member
Mar 3, 2017
1,682
6,197
136
The RDL fanout connection is parallel, unlike SerDes which is serial. So the latency should be lower with a parallel link.
RDL fanout links are just signal links, you can still serialize the data through if you want. And usually there is a higher level encoding and error detection/correction scheme as well.
At the front of the link lies the PHY/Line driver which is where the bulk of the power is burnt from all the insertion losses takes place due to the impedance of the traces.
The serialization (and usually with compression) is on die and is relatively cheap compared to the transmission in terms of energy consumption.

Because the excitation of the signal pulse over the line has a delay due to impedance, there is some circutry like TCoils to counteract the capacitive behavior of the line. This puts a limit on the frequency that can be achieved over the traces. But TCoils occupy some space and making more Tx/Rx lines is expensive in terms of die area (and power).

The advantage of RDL is therefore trace density which means you can have more lines and lesser attenuation (with the help of some dielectric magic in between the layers). But dies need to be closer to each other otherwise again some kind of line driver will be needed. And this makes the next logical step after RDL to be LSI, with even more trace density and active switching to repeat the signal at native operating voltages over a bigger distance and eliminating the PHY totally.

So this is a long way to say yes you can do full parallel access over the RDL links given the trace density and frequency improvement, but likely you'd still have some serialization or signal muxing and demuxing in between (depending on how far the dies are separated you might even need a low power line driver)
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |