Discussion RDNA4 + CDNA3 Architectures Thread

DisEnchantment · Mar 23, 2022

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits

History for llvm/lib/Target/AMDGPU - llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - History for llvm/lib/Target/AMDGPU - llvm/llvm-project

github.com

Or Phoronix

More AMD "GFX940" Enablement Work Landing In LLVM - Phoronix

www.phoronix.com

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.

Previous thread on CDNA2 and RDNA3 here

Question - Speculation: RDNA3 + CDNA2 Architectures Thread

Man I have been dying to make this one for a while now. First rumours for RDNA3 are here so new thread time! Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3...

forums.anandtech.com

inquiss · Feb 23, 2025

CastleBravo said:
It is only a clean win over the 4070 Ti Super if FSR4 upscaling matches DLSS4.

IMO, the right play for AMD is an "msrp" of $550-600, and an initial retail price of $700-750 for most AIB models for long as 5070 Ti is unobtanium.

Exactly. The latest feature needs to be the same right? Always a new feature. Nvidia really inferior to AMD on Radeon chill though. Maybe Nvidia cards should be discounted in laptop if equal performance?

inquiss · Feb 23, 2025

gdansk said:
The feature which reduces two different $1600 Nvidia GPUs to ~17ms frametimes at 1920x1080? For some reason I'm not too worried about needing to turn that crap on. Even in the near future.

Exactly. Very sensible take. Just the next goal post shift for Nvidia fans to justify their purchase when it may be technically stronger but you'd never use the feature for a few generations. Just like the first RTX

adroc_thurston · Feb 23, 2025

gaav87 said:
Crazy take glued n48 on gddr6 + glued defects with cut down phys also gddr6 possible ?

Glued on what.

inquiss said:
Exactly. Very sensible take. Just the next goal post shift for Nvidia fans to justify their purchase when it may be technically stronger but you'd never use the feature for a few generations. Just like the first RTX

Whatever.

In2Photos · Feb 23, 2025

adroc_thurston said:
they need a $3k SRP target halo lol.

they're not guesses.
N4 is like $15.5k a wafer.

So you're guessing? Otherwise you'd have an exact number.

adroc_thurston · Feb 23, 2025

In2Photos said:
Otherwise you'd have an exact number.

It's an exact number.
N4 is $15.5k per wafer.

gaav87 · Feb 23, 2025

adroc_thurston said:
Glued on what.

Whatever.

I think they will connect two N48 on the same package and create 128cu gddr6 monster.
Software already supports this with more than 2 micro engine schedulers for example.
What they need:
1. high bandwidth interconnect with minimum 1280GB/s rdna3 IF already supports ~883GB/s per mcd
2. memory system: unified memory pool mapped across both chiplets in firmware or keep the memory local to each die with a fast interconnect maped to each pool
3. Scheduling already suports it with more than 1 micro engine scheduler in kernel
4. Cache coherence would be hurt with increased latency could be minimalized by cross chiplet cache traffic but i remember reading something about this in kernel also

Still even if it scaled +60% vs 9070xt they would have, a winner

gaav87 · Feb 23, 2025

In2Photos said:
So you're guessing? Otherwise you'd have an exact number.

hes not guessing.

gdansk · Feb 23, 2025

gaav87 said:
1. high bandwidth interconnect with minimum 1280GB/s rdna3 IF already supports ~883GB/s per mcd

and where is it on N48? 🤔

adroc_thurston · Feb 23, 2025

gaav87 said:
I think they will connect two N48 on the same package and create 128cu gddr6 monster.
Software already supports this with more than 2 micro engine schedulers for example.
What they need:
1. high bandwidth interconnect with minimum 1280GB/s rdna3 IF already supports ~883GB/s per mcd
2. memory system: unified memory pool mapped across both chiplets in firmware or keep the memory local to each die with a fast interconnect maped to each pool
3. Scheduling already suports it with more than 1 micro engine scheduler in kernel
4. Cache coherence would be hurt with increased latency could be minimalized by cross chiplet cache traffic but i remember reading something about this in kernel also

Still even if it scaled +60% vs 9070xt they would have, a winner

It doesn't exist. It doesn't work.
N48 is a boring product made for boring reasons.

gaav87 · Feb 23, 2025

gdansk said:
and where is it on N48? 🤔

We will know on 28th

MrTeal · Feb 23, 2025

gaav87 said:
I think they will connect two N48 on the same package and create 128cu gddr6 monster.
Software already supports this with more than 2 micro engine schedulers for example.
What they need:
1. high bandwidth interconnect with minimum 1280GB/s rdna3 IF already supports ~883GB/s per mcd
2. memory system: unified memory pool mapped across both chiplets in firmware or keep the memory local to each die with a fast interconnect maped to each pool
3. Scheduling already suports it with more than 1 micro engine scheduler in kernel
4. Cache coherence would be hurt with increased latency could be minimalized by cross chiplet cache traffic but i remember reading something about this in kernel also

Still even if it scaled +60% vs 9070xt they would have, a winner

That's a lot of ifs for stuff that isn't baked into N48 already.

It's going to be a lot easier for them to just make a 6-700mm² 128 CU die if that's what they want than screw around with dual GCDs. Nvidia's hand was forced with Blackwell because it's impossible to manufacture a 1600mm² die, and even with Nvidia's resources they've had manufacturing issues getting B200 out the door.

igor_kavinski · Feb 23, 2025

adroc_thurston said:
N48 is a boring product made for boring reasons.

Concur. Hope the B770 comes with 24GB VRAM and sells cheaper than 9070 XT.

adroc_thurston · Feb 23, 2025

MrTeal said:
It's going to be a lot easier for them to just make a 6-700mm² 128 CU die

128CUs will be 500mm^2 or therein.

poke01 · Feb 23, 2025

adroc_thurston said:
5 even moreso.

Isn’t the Xbox handheld using 5?

Is 5 coming next year?

adroc_thurston · Feb 23, 2025

poke01 said:
Isn’t the Xbox handheld using 5?

No idea. Don't care either. Xbox has no games.

poke01 said:
Is 5 coming next year

H2'26 yes, they're 8q-ish cadence.

Josh128 · Feb 23, 2025

marees said:
So what are the guesses / explanations on AMD achieving 60+% increase over the 7800xt despite being on the same process node 🤔

More cache !!??

https://twitter.com/x/status/1304489928995344384

They didnt achieve 60%+ over 7800XT. AMD's own numbers show +51% over 6900XT, and almost half of which test results cited to obtain that number include RT on. TPU's 7800XT reviews show the 6900XT 3% faster in 4K raster, but the 7800XT as 3% faster in RT, which when combined, effectively makes 7800XT=6900XT. Therefore, AMD's own numbers indicate that 9070XT is ~+51% vs 7800XT, not 60%+.

HuesToo4 · Feb 23, 2025

adroc_thurston said:
128CUs will be 500mm^2 or therein.

Why would they flip N44 once for N48 but not twice into another die if area was going to be around 500mm2 for 128CU?

adroc_thurston · Feb 23, 2025

HuesToo4 said:
Why would they flip N44 once for N48 but not twice into another die if area was going to be around 500mm2 for 128CU?

Because everything above N48 was a chtonic SoIC monstrosity.

gaav87 · Feb 23, 2025

adroc_thurston said:
Because everything above N48 was a chtonic SoIC monstrosity.

so they tried vertical stacking n48 interesting ?

gaav87 · Feb 23, 2025

Can't wait for 28th and die shots too see if there are any leftovers xD

adroc_thurston · Feb 23, 2025

gaav87 said:
so they tried vertical stacking n48 interesting ?

no. A completely different design, wholly unrelated to N44/48.
These two are super basic volume fillers for poor people.

reaperrr3 · Feb 23, 2025

igor_kavinski said:
Concur. Hope the B770 comes with 24GB VRAM and sells cheaper than 9070 XT.

If B770 comes out at all, it'll be 16GB, because all rumors are pointing to a 256bit GDDR6 mem interface.
32 GB variants for AI market are possible, but those will be priced up accordingly.

And of course it'll (have to) be cheaper than the 9070 XT, since it'll be much slower.

G31 has only 60% more EUs than B580, only ~35% more bandwidth and probably somewhat lower clocks, too.
B580 is only slightly faster than 7600XT and already CPU-bound in many games (sometimes even at 1440p), so B770 will probably struggle to beat even the 7800XT in raster consistently.

They'd probably have to price a B770-16GB at $399 if they want it to be competitive in perf/$ vs. the 9070 and 5070.

HuesToo4 said:
Why would they flip N44 once for N48 but not twice into another die if area was going to be around 500mm2 for 128CU?

The smallest chiplet-based SKU was probably cancelled long before the rest and replaced by N48.
But by the time they cancelled the bigger chiplet configs as well, it was too late from a time-to-market perspective to design another, bigger mono-N4x above N48 to replace them, because that would probably have come out a year too late and too close to the planned release time-frame of RNDA5.

uzzi38 · Feb 23, 2025

poke01 said:
Isn’t the Xbox handheld using 5?

Is 5 coming next year?

Probably not. It's supposed to be late this year, afaik.

PJVol · Feb 23, 2025

gaav87 said:
We will know on 28th

IIRC a monolitic asic has only GMI-type links (or what it called in cdna), unlike MCM's which has additional bulky xGMI-like IFs to unify on-chip DFs.
But anyway, it's not possible to "glue" them with just this interconnect. There are too many things that need to be shared in GPU at a lower abstraction level than DF.

poke01 · Feb 23, 2025

Also will FSR4 be open source?

Discussion RDNA4 + CDNA3 Architectures Thread

Golden Member

Senior member

Senior member

Diamond Member

Platinum Member

Diamond Member

Senior member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Senior member

Member

Diamond Member

Senior member

Senior member

Diamond Member

Member

Platinum Member

Senior member

Diamond Member