Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 11 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

adroc_thurston

Diamond Member
Jul 2, 2023
5,438
7,609
96
Why jump straight to something that complex and expensive
They're good at it.
Assuming RDNA 5 uses chiplets
It doesn't. They don't have the manpower for that anymore.
3D stack the GCD on the MCD and you have a low-end product. No expensive/complex interposer or bridge packaging needed. Just two dies like Zen X3D. Then combine two and three stacks for the mid-range and high-end products. Those would require some sort of interposer or active bridge. The product stack would look like this:
That's just N4C with even more tapeouts.
No bueno.
 
Reactions: marees

GTracing

Senior member
Aug 6, 2021
459
1,059
106
I went back and looked at the Navi 4C designs. Why jump straight to something that complex and expensive? Assuming RDNA 5 uses chiplets, why not go for a simpler layout? This is what I’m thinking:

GCD - 40 CU (80-100mm2)
MCD - 128 bit bus + 32mb Infinity Cache (~70mm2)

3D stack the GCD on the MCD and you have a low-end product. No expensive/complex interposer or bridge packaging needed. Just two dies like Zen X3D. Then combine two and three stacks for the mid-range and high-end products. Those would require some sort of interposer or active bridge. The product stack would look like this:

Peasants - 40 CU, 128 bit bus, 32 mb cache, 2 dies (150-170mm2)

Plebeians - 80 CU, 256 bit bus, 64 mb cache, 4 dies (300-340mm2)

Whales - 120 CU, 384 bit bus, 96 mb cache, 6 dies (450-510mm2)

That’s pretty conservative in terms of silicon and covers most of the product stack with two dies and less advanced packaging than what they were working on. Assuming N2X, the low-end version might approach a 9700 XT while the high-end would be at-least twice as fast. They could maybe fit 60 CU but it hards to say without knowing how much bigger RDNA 5 CU’s might be or if they use N3P vs N2X.
There are a few things that make it more complicated than that. Just to cover the three biggest.

Firstly, there are things like the media engine, display engine, and command processor that you don't need or want multiples of.

Secondly, in that arrangement the display PHYs would be in the top die in a stack. I don't think that's possible.

Thirdly, each CU needs access to all the memory. You would need interconnects with crazy amounts of bandwidth, which would almost certainly be expensive/complex.
 

Kronos1996

Member
Dec 28, 2022
58
98
61
Poor bastards. Unfortunately, consumer and server GPU’s have diverged to the point where they have little in common. Sharing dies between product stacks isn’t possible in the way it is on the CPU side. I do wonder about sharing package designs though.

Navi 4C is pretty similar to MI300. If they designed both to use the same packaging, that may help with R&D costs. Having the option to use HBM MCD’s from Instinct on professional RDNA cards would be especially advantageous.
 

Kronos1996

Member
Dec 28, 2022
58
98
61
oh hell no, it was much more complex.

doesn't have dat, the IMC is always on AID.
MI400 then. Just theorizing about why they’d use such a complex package for consumer products. Perhaps they’re trying to reuse a package design made for server? Yeah it might cost more but the R&D was already paid for.

Would also open up some options for die sharing in certain markets. Slapping RDNA GCD’s on top of CDNA AID’s (or whatever stupid name they call them now) would make one helluva workstation product. HBM would be a decisive advantage in the AI war.
 

adroc_thurston

Diamond Member
Jul 2, 2023
5,438
7,609
96
Just theorizing about why they’d use such a complex package for consumer products.
a) sometimes you gotta win.
b) they have somewhat of a fetish for advanced packaging. Remember, they tried to force HBM into client.
Slapping RDNA GCD’s on top of CDNA AID’s (or whatever stupid name they call them now) would make one helluva workstation product. HBM would be a decisive advantage in the AI war.
none of that is product level viable.
Chiplets in gfx are a "win more" option.
 

basix

Member
Oct 4, 2024
89
179
66
The "simplest" Multi-Chip approach is that what Nvidia does with B100/200: Just fuse two GPUs together. That makes a ~300...350mm2 GPU a little bit bigger (you need to have an additional chip-to-chip interface) but does not add other cost. Smaller Die of the same portfolio are not affected as well. The big GPU, which uses two of the chiplets (600...700mm2 total) would have doubled media engines etc. but for a Halo part not a huge problem. You can also put those in use for prosumers etc. (e.g. GB202 has more video encoders/decoders compared to the the smaller Blackwell GPUs).

Packaging cost should also not increase by too much. I think RDNA3 alike organic "Infinity Fanout Links" are enough for a few TByte/s of bandwidth. For VRAM and Infinity Cache bandwidth surely enough. The question is, what happens with the L2$? I assume, that this one needs to be some sort of "private" for each chiplet. I do not see, that you can route much if not all L2$ traffic over the chip-to-chip interface and keep the cost reasonable.
 

branch_suggestion

Senior member
Aug 4, 2023
641
1,358
96
The "simplest" Multi-Chip approach is that what Nvidia does with B100/200: Just fuse two GPUs together.
Easy for compute, graphics is a whole other story due to the serial nature of APIs.
Also the bigger you go, the demands on interconnect bandwidth go up exponentially.
That makes a ~300...350mm2 GPU a little bit bigger (you need to have an additional chip-to-chip interface) but does not add other cost. Smaller Die of the same portfolio are not affected as well. The big GPU, which uses two of the chiplets (600...700mm2 total) would have doubled media engines etc. but for a Halo part not a huge problem. You can also put those in use for prosumers etc. (e.g. GB202 has more video encoders/decoders compared to the the smaller Blackwell GPUs).
Thing is you need a big, expensive and high demand substrate, and you could just build a big enough monolithic part.
Packaging cost should also not increase by too much. I think RDNA3 alike organic "Infinity Fanout Links" are enough for a few TByte/s of bandwidth.
Not between compute engines, you need CoWoS-L for enough bandwidth to sync L3+memory ~5TB/s bidirectional. L2, forget about it, you would need to have 2 GPUs work together as one through all sorts of tricks.
CoWoS is a nonstarter for client, same reasons as HBM.
Much easier to build one big compute engine and stack it on top of cache and memory PHYs.
For VRAM and Infinity Cache bandwidth surely enough. The question is, what happens with the L2$? I assume, that this one needs to be some sort of "private" for each chiplet. I do not see, that you can route much if not all L2$ traffic over the chip-to-chip interface and keep the cost reasonable.
Well N4C had L2 private to each SED, even with SoIC-X having coherent L2 across multiple chiplets is a very tall ask.
So once again the best idea without going beyond retscale base dimensions is a simple 3D stack, frontend+compute+L2 up top, L3+memory below.
And I guess a MID or two connected to the base with fanouts for IO. Thing is expensive enough as is.
 
Reactions: Tlh97

Kronos1996

Member
Dec 28, 2022
58
98
61
I understand they gotta prioritize server but abandoning chiplet consumer GPU’s may be foolish. I assumed the plan was to share GPU chiplets with laptop. Future laptop chips could then share CPU and GPU dies with desktop. The only custom die would be an IOD of some sort. Probably with the GPU chiplet 3D stacked on top. They could iterate a lot faster and wouldn’t have to redo as much work every year. Just make a new IOD each time.

If OEM’s still insist on an annual cadence, maybe they do a mid-gen refresh chiplet for the CPU and GPU. Zen 6+ and RDNA 5.5 for example. They could also use those on desktop for an annual release cadence. With the introduction of Halo, AMD is perfectly positioned to dominate this new premium segment. That would be easier if they could share GPU chiplets with desktop though.
 

Vikv1918

Junior Member
Mar 12, 2025
11
23
36
I understand they gotta prioritize server but abandoning chiplet consumer GPU’s may be foolish. I assumed the plan was to share GPU chiplets with laptop. Future laptop chips could then share CPU and GPU dies with desktop. The only custom die would be an IOD of some sort. Probably with the GPU chiplet 3D stacked on top. They could iterate a lot faster and wouldn’t have to redo as much work every year. Just make a new IOD each time.

If OEM’s still insist on an annual cadence, maybe they do a mid-gen refresh chiplet for the CPU and GPU. Zen 6+ and RDNA 5.5 for example. They could also use those on desktop for an annual release cadence. With the introduction of Halo, AMD is perfectly positioned to dominate this new premium segment. That would be easier if they could share GPU chiplets with desktop though.
Maybe AMD is saving chiplets as a "break glass in case of emergency" plan if nvidia gets too competitive. Right now, they dont need it as their monolithic architecture is good enough. In laptop especially they have zero ambition and dont care that they have 0.01% marketshare, so laptops don't matter enough to design their GPU around it. They've already secured PS6 and nextbox so they have guaranteed revenue to chug along at least for the next 5-7 years.
 

UsedTweaker

Junior Member
Apr 3, 2025
1
1
6
The only way GPU chiplets work today is with ultra high bandwidth packaging, aka the stuff AI looks like it might order through straight into RDNA5 ship date anyway (at TSMC anyway). The RDNA3 chiplet strategy wouldn't offer much, die size of the PHIs and sram on N48 are <20%, hiving off a 70mm chiplet doesn't save much $$. Plus that adds latency going out to big cache/main memory and GPUs aren't as good at hiding latency as you'd expect.

You'd need packaging that could split the logic of the next chip in half or something to matter much. 2 150mm dies increases yield noticeably over 1 300mm die.

Maybe the packaging could be something cheap though. Moving GDDR to on package over soldered would give a latency boost of a couple 10s of ms. That matters, especially in raytracing, and you could plausibly swing having AIBs pay for that today and keep up the financial trickery of pretending you're only selling the dies so your profit margins look bigger.
 
Reactions: Darkmont

adroc_thurston

Diamond Member
Jul 2, 2023
5,438
7,609
96
Plus that adds latency going out to big cache/main memory and GPUs aren't as good at hiding latency as you'd expect.
'latency' add is negligeble and GPUs are very much good at hiding latency.
that's why they exist.
Maybe the packaging could be something cheap though. Moving GDDR to on package over soldered would give a latency boost of a couple 10s of ms.
MoP does not do anything about latency.
That's not how DRAM works.
 

Kronos1996

Member
Dec 28, 2022
58
98
61
The only way GPU chiplets work today is with ultra high bandwidth packaging, aka the stuff AI looks like it might order through straight into RDNA5 ship date anyway (at TSMC anyway). The RDNA3 chiplet strategy wouldn't offer much, die size of the PHIs and sram on N48 are <20%, hiving off a 70mm chiplet doesn't save much $$. Plus that adds latency going out to big cache/main memory and GPUs aren't as good at hiding latency as you'd expect.

You'd need packaging that could split the logic of the next chip in half or something to matter much. 2 150mm dies increases yield noticeably over 1 300mm die.

Maybe the packaging could be something cheap though. Moving GDDR to on package over soldered would give a latency boost of a couple 10s of ms. That matters, especially in raytracing, and you could plausibly swing having AIBs pay for that today and keep up the financial trickery of pretending you're only selling the dies so your profit margins look bigger.
I mean Intel has separate GPU tiles on their mobile SoC’s now. Other issues aside, EMIB seems quite affordable now that it’s at volume. I don’t think it’s far-fetched AMD follows suit before long. Those super complex SoC’s are quite difficult to make apparently. Separating more things out of the main die should make design easier.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |