- Mar 3, 2017
- 1,682
- 6,197
- 136
Not likely, there simply isn't enough space. Some folks have attempted a mockup, but mockups ignore the electrical connections, and more importantly, the distance between those connections. For AMD to increase core count they would have to do one of the following:
- decrease the CCD size.
- decrease the IOD size.
- stack low performance 'small' cores on top of one of the dies. (not likely, heat is an issue)
- Use a 'hybrid' design.
- move to a monolithic die.
- Use a dense process
- Some combination of above.
Denser core chiplets seem like the most obvious choice while constrained on the current total package size including substrate base but as I said in the past this does bring a critical motion to the nature of the core density and heat. A larger overall size wouldn't be an issue if there were space on boards today and given the added junk on boards that could be offset with enterprise trickle we're not going to see anything dramatic.Not likely, there simply isn't enough space. Some folks have attempted a mockup, but mockups ignore the electrical connections, and more importantly, the distance between those connections. For AMD to increase core count they would have to do one of the following:
- decrease the CCD size.
- decrease the IOD size.
- stack low performance 'small' cores on top of one of the dies. (not likely, heat is an issue)
- Use a 'hybrid' design.
- move to a monolithic die.
- Use a dense process
- Some combination of above.
Not likely, there simply isn't enough space. Some folks have attempted a mockup, but mockups ignore the electrical connections, and more importantly, the distance between those connections. ...
Stretch goal == best case scenario.You meant this slide below?
View attachment 78408
Note 1 in the slide above seems to suggest that the next gen SP5 part can attain DDR5-6400 frequencies, which is a decent bump. DT will likely go beyond that.
However in Zen 4, fclk and mclk are not 1:1. So higher DDR5 speed does not really mean a higher fabric clock.
Due to the fabric clock limit, at max ~2GHz currently, there is a threshold beyond which increasing the RAM speed has no impact on the memory latency.
IFOP clock would have likely been capped at 2GHz since Zen 2 due to insertion losses and the high 2pJ/bit energy usage.
They don't really need an interposer to improve this and on package RDL in the N31 MCDs seems good enough.
It seems AMD got 0.4pJ/bit on the N31 RDL fanout links compared to 0.2-0.3 pJ /bit for GUC GLink. I did see 64 Gbps links on LinkedIn mentioned for new GMI.
Going forward, if Strix being chiplet is really a thing, they are going to absolutely need a new interconnect considering Dragon Range is idling at 10W (granted Power Gating is not as fine grained as in the purpose built Mobile APU and not sure if there is scaling of the fclk and lanes as per the workload)
AMD needs to hype its AI/ML/datacenter GPU business. They need to constantly ensure the investors they are still trying to get into the game for that lucrative segment of AI. Hence the MI300 show.BTW, it is interesting that AMD hides these details (very effectively) on the CPU - Zen 5 side, while sharing info on Mi300 datacenter GPU side...
Add to this that the main hypable USP of Epyc chips (amount of cores) have been plenty obvious so far.Rome, Milan, and Genoa have been doing fine. Thus there is no need to spend time hyping Turin.
Intel can actually hang with AMD in MT workloads now, and while they certainly have an efficiency edge there are plenty of people who don't care about such things.
The kind of people who would want a 24-32 core desktop CPU are those who are probably running rendering software that will gladly scale beyond 16 cores. Core to core latency doesn't matter much in those cases.
There are plenty of workloads that wouldn't care if AMDs solution to offering a 24-core CPU was just adding another chiplet anymore than they would about AMD putting 12 cores on a CCD.
Just don't expect it to be cheap assuming it does exist. AMD would likely try to maintain pricing based on the number of cores similar to the previous generation and use this as an excuse to have a $1,000 CPU. But it's less expensive that Threadripper so some people will buy it.
I'm pretty sure there is no direct routing between chiplets. It has to go through the IO chiplet itself.BTW, are the chiplets in Ryzen connected only to IO chip, or between themselves as well? So if cores on one chip needs to communicate with cores on the other one, does this need to happen via the IO chip, or not?
I'm pretty sure there is no direct routing between chiplets. It has to go through the IO chiplet itself.
It goes to IO, but the IO is able to know if another CCD has the data needed, at which point it retrieves it from the other CCD, if not from the RAM.
Could be wrong tho, but that's how I remember it.
Don't know about that, but IIRC even for zen 2, when the CCXs were literarily right next to eachother on the same CCD, they still had to go through the IO die to communicate with eachother's L3.Is the bolded the reason why this is the case? Would not be otherwise better to have the 2 chips being able to communicate to each other directly without the need to go via IO chip? And if the other chip does not have the data, then retrieve it from RAM (obviously now IO chip would have to be included in that).
Anyway, i recall in Zen 1 days, in first Threadrippers, when there was no IO chip, the CCDs used to be connected to each other, right?
Anyway, i recall in Zen 1 days, in first Threadrippers, when there was no IO chip, the CCDs used to be connected to each other, right?
for zen 2, when the CCXs were literarily right next to eachother on the same CCD, they still had to go through the IO die to communicate with eachother's L3.
I wonder if AMD can use the same tech they used for RDNA 3, i.e. that high performance fanout, instead of going all the way to the stacking route. Only issue I see with the HP fanout approach is that the dies likely need to be super close to each other, and I'm not sure that is viable for server products where the CCDs can be quite far from the IOD.They don't get to increase IFOP performance, or any sort of chip to chip communications, to a significant degree until they start to stack chips. I envision a possible world where AMD has an N6 IOD with two CCDs stacked on top of it, one with 8 HP cores and one with 16 HD cores. They can easily have 2-4X the amount of bandwidth between the CCXs and the IOD by doing that. Having the heat spreader so thick on the current AM5 socket allows them to increase the chip Z-height on the MCP chips by thinning the heat spreader. Such a setup would easily best the 13900K on MT tasks and may even get a good boost in ST tasks as the latency for memory access should be notably lower.
Don't get me wrong, there ARE ways of improving chip to chip communications without stacking, but they are costly and likely more prone to yield issues.
The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.I wonder if AMD can use the same tech they used for RDNA 3, i.e. that high performance fanout, instead of going all the way to the stacking route. Only issue I see with the HP fanout approach is that the dies likely need to be super close to each other, and I'm not sure that is viable for server products where the CCDs can be quite far from the IOD.
The MCDs placed the cache in another chiplet when the previous IF cache was on die and close. I don't think the fanout is itself inherently high latency and anyhow, the chiplet CPUs already have the IOD some distance away.The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.
But isn't the latency hit from the transition to an MCM approach due to the fact that a monolithic design will always have better latency? Conversely, the current chiplet approach for their CPUs already took this latency hit so I doubt using shorter, denser wires via the HP fanout will add even more latency than the current SerDES IFOPs, but I could be mistaken.The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.
The MCDs placed the cache in another chiplet when the previous IF cache was on die and close. I don't think the fanout is itself inherently high latency and anyhow, the chiplet CPUs already have the IOD some distance away.
Both true, points taken. My point was the fanout design as part of the MCDs was focused on bandwidth whereas CPUs by far don't have the same bandwidth needs, but latency is more important.But isn't the latency hit from the transition to an MCM approach due to the fact that a monolithic design will always have better latency? Conversely, the current chiplet approach for their CPUs already took this latency hit so I doubt using shorter, denser wires via the HP fanout will add even more latency than the current SerDES IFOPs, but I could be mistaken.
Edit: Haha, just saw @maddie's post. Yes, that point precisely.
Just look at how close the dies are placed to each other on MI300 and N31.Not likely, there simply isn't enough space. Some folks have attempted a mockup, but mockups ignore the electrical connections, and more importantly, the distance between those connections. For AMD to increase core count they would have to do one of the following:
- decrease the CCD size.
- decrease the IOD size.
- stack low performance 'small' cores on top of one of the dies. (not likely, heat is an issue)
- Use a 'hybrid' design.
- move to a monolithic die.
- Use a dense process
- Some combination of above.
Firstly, this involves a lot of production cost increases because of more expensive packaging as well as more die area for the CCDs because of the additional ports.Is the bolded the reason why this is the case? Would not be otherwise better to have the 2 chips being able to communicate to each other directly without the need to go via IO chip?
What?They don't get to increase IFOP performance, or any sort of chip to chip communications, to a significant degree until they start to stack chips.
If I had a farm, I would bet it for Infinity Fan-out Links, respectively InFO-RDL, to have at least the same, if not significantly lower latencies than IFoP. If you meant that latency was worse than an IMC, then yes, the monolithic would be in front.The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.
BTW, are the chiplets in Ryzen connected only to IO chip, or between themselves as well? So if cores on one chip needs to communicate with cores on the other one, does this need to happen via the IO chip, or not?
I wonder if AMD can use the same tech they used for RDNA 3, i.e. that high performance fanout, instead of going all the way to the stacking route. Only issue I see with the HP fanout approach is that the dies likely need to be super close to each other, and I'm not sure that is viable for server products where the CCDs can be quite far from the IOD.
The move to MCDs from RDNA2 to 3 increased bandwidth by a lot, but also worsened latency. This may be a tradeoff AMD did for GPUs, but if the inherent latency of the tech even in the best case is significant keeping using bandwidth limited SerDes IFOPs instead may still be preferable in MCM CPUs.
Zen 2-4 fabric topology is much more complex unlike the rdna3 "star", taking into account multi-cpu configs. Even client-side cpu uses 4 layers just for IFOP routing. Add to this power RDLs, interfaces route layers, etc.I wonder if AMD can use the same tech they used for RDNA 3
RDL fanout links are just signal links, you can still serialize the data through if you want. And usually there is a higher level encoding and error detection/correction scheme as well.The RDL fanout connection is parallel, unlike SerDes which is serial. So the latency should be lower with a parallel link.