Discussion RDNA4 + CDNA3 Architectures Thread

DisEnchantment · Mar 23, 2022

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits

History for llvm/lib/Target/AMDGPU - llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - History for llvm/lib/Target/AMDGPU - llvm/llvm-project

github.com

Or Phoronix

More AMD "GFX940" Enablement Work Landing In LLVM - Phoronix

www.phoronix.com

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.

Previous thread on CDNA2 and RDNA3 here

Question - Speculation: RDNA3 + CDNA2 Architectures Thread

Man I have been dying to make this one for a while now. First rumours for RDNA3 are here so new thread time! Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3...

forums.anandtech.com

gaav87 · Jan 4, 2025

SolidQ said:
RDNA2 via DP4a?

More like hyper-rx bs

poke01 · Jan 4, 2025

does FSR4 include Neural Rendering? Hopefully it does

SolidQ · Jan 4, 2025

poke01 said:
Hopefully it does

Also this - https://gpuopen.com/learn/neural_supersampling_and_denoising_for_real-time_path_tracing/
Seems RDNA4 can do Path Traing, if AMD prepare it

poke01 · Jan 4, 2025

SolidQ said:
Also this - https://gpuopen.com/learn/neural_supersampling_and_denoising_for_real-time_path_tracing/
Seems RDNA4 can do Path Traing, if AMD prepare it

Yeah I seen that, but will NE launch when RX 9070 launches

adroc_thurston · Jan 4, 2025

poke01 said:
Neural Rendering

The hell does that mean

GodisanAtheist · Jan 4, 2025

adroc_thurston said:
The hell does that mean

It's what my brain does that lets me see.

soresu · Jan 4, 2025

GodisanAtheist said:
It's what my brain does that lets me see.

And every other sensory input for that matter.

gaav87 · Jan 4, 2025

SolidQ said:
RDNA2 via DP4a?

No rdna2 will be not included it does not have wmma or will work on fp16 but very slow.

basix said:
From what I heard FSR4 uses FP8 path on RDNA4 and what I speculate FP16 path on everything else (older AMD GPUs, Intel, Nvidia, Consoles, APUs). Maybe there is some additional "Block-FP16 = FP8 Throughput" path for AMD APUs with XDNA2 NPU (Strix Point & Halo).

FP8 and FP16 makes sense to me. It is more accurate than INT8 (=less DNN parameters required or higher DNN quality) and INT8 is a mess on AMD GPUs (e.g. N10 has lower INT8 rate than its smaller siblings, PS5 is unclear, RDNA3 has same INT8 rate as FP16, etc.) and would get killed by Nvidia Tensor Core INT8 (they simply have much higher throughput). FP8 throughput on RDNA4 should be on a similar level like FP16 Tensor on similarly sized Nvidia and Intel GPUs. With INT8 it would probably be slower on RDNA2/3/4 compared to their Nvidia counterparts. Not a good idea.

It seems to me to be more DLSS alike or what ARM does on mobile (see their Siggraph presentation from 2024). Parameter prediction, the rest is very similar to FSR2/3. XeSS seems to rely on a much heavier DNN.

FSR4 precursor:
https://gpuopen.com/learn/neural_supersampling_and_denoising_for_real-time_path_tracing/

Yes some features can work on wmma but slower than on rdna4 (they added fp8 and int4) also has swmma (sparsity)

gaav87 · Jan 4, 2025

Everyone focused on nvidia texture compression hers amd version:

neural texture and meshlet compression.

https://gpuopen.com/download/publications/2024_NeuralTextureBCCompression.pdf

Loading…

gpuopen.com

gaav87 · Jan 4, 2025

SolidQ said:
Also this - https://gpuopen.com/learn/neural_supersampling_and_denoising_for_real-time_path_tracing/
Seems RDNA4 can do Path Traing, if AMD prepare it

Also i just heard all games that have fsr3.1 will work on fsr4

ToTTenTranz · Jan 4, 2025

adroc_thurston said:
The hell does that mean

I think it's without raster or raytracing. Neural hallucination every frame.

SolidQ · Jan 4, 2025

adroc_thurston said:
The hell does that mean

Huang gonna control every RTX 5xxx user

poke01 · Jan 4, 2025

adroc_thurston said:
The hell does that mean

Go read the GPU open docs from amd

poke01 · Jan 4, 2025

gaav87 said:
Everyone focused on nvidia texture compression hers amd version:

neural texture and meshlet compression.

https://gpuopen.com/download/publications/2024_NeuralTextureBCCompression.pdf

Loading…

gpuopen.com

View attachment 114239

Cool I know it exists, when does it release to the masses

adroc_thurston · Jan 4, 2025

poke01 said:
Go read the GPU open docs from amd

The what.
You mean like new texture compression schemes? That's not interesting at all.

Win2012R2 · Jan 4, 2025

GodisanAtheist said:
It's what my brain does that lets me see.

Then you must wait for GeForce 6666 series - it will include a new high speed cable that will connect* right into yer cerebral cortex. I can't disclose where would mass market 6660 connect to, but it's pretty far away from where the brain is for most people, though this is currently subject of internal discussion, perhaps this is where the brain actually is on average...

* actual drilling operation is optional extra

Win2012R2 · Jan 4, 2025

SolidQ said:
Huang gonna control every RTX 5xxx user

He is already controlling all NVidia users in one way that matters - getting them to pay him money...

gaav87 said:
they added fp8 and int4

Int4 confirmed? I hope it's at least double speed, not just same speed as Int8

basix · Jan 5, 2025

INT4 WMMA exists already on RDNA3. The only question is its rate. Same as INT8/FP8 or double.
https://gpuopen.com/learn/wmma_on_rdna3/
https://www.amd.com/content/dam/amd...r-instruction-set-architecture-feb-2023_0.pdf -> chapter 7.9

adroc_thurston · Jan 5, 2025

basix said:
INT4 WMMA exists already on RDNA3. The only question is its rate. Same as INT8/FP8 or double.
https://gpuopen.com/learn/wmma_on_rdna3/
https://www.amd.com/content/dam/amd...r-instruction-set-architecture-feb-2023_0.pdf -> chapter 7.9

int4 isn't usable for anything anywhere anyway. Forget about it.

basix · Jan 5, 2025

At least all vendors support it (Intel, Nvidia, AMD). Currently, INT4 is rarely used but with potentially upcoming Ternary ML-Models (weights = [-1, 0, 1]) it might get more popular (https://arxiv.org/abs/2402.17764). Intel even supports INT2 which could be even better for ternary weights as long as you keep the INT4 accumulate output (INT2 is not sufficient for representing 1+1=2)

Why is ternary likely to happen: Money

adroc_thurston · Jan 5, 2025

basix said:
Currently, INT4 is rarely used but with potentially upcoming Ternary ML-Models (weights = [-1, 0, 1]) it might get more popular

It literally went poof from NV GPUs ever since H100.
Now, FP4/6 microscaled? Yeah maybe.

Please refrain from graphic language in the tech section. -Moderator Shmee

basix · Jan 5, 2025

Yes. Currently, FP8 is widely used. FP4 with microscaling makes it better.

Ternary as proposed in the linked paper has another big, big benefit: You get away with using solely adders. No multiplier necessary. That makes HW much simpler, smaller and more energy efficient. This is not a topic for todays HW but maybe for future HW generations or specialized accelerators. I look at Microsoft in-house silicon, e.g. specialized CDNA5 chiplets with ternary inferencing & training (ternary means, that you also train in ternary scaling) as sole use case or maybe an addition to XDNA3.

But I think we are drifting away from the RDNA4 topic

gaav87 · Jan 5, 2025

adroc_thurston said:
int4 isn't usable for anything anywhere anyway. Forget about it.

For ML (fsr4)

basix said:
INT4 WMMA exists already on RDNA3. The only question is its rate. Same as INT8/FP8 or double.
https://gpuopen.com/learn/wmma_on_rdna3/
https://www.amd.com/content/dam/amd...r-instruction-set-architecture-feb-2023_0.pdf -> chapter 7.9

wmma for rdna4
a/16x16 int4
b/ 16x32 int4
result 16x32 int32
vs rdna3 16x16x16 int4 ->int32
and with swmma we get int4 a/16x16 b/16x64 result 32x64 int32

basix · Jan 5, 2025

gaav87 said:
wmma for rdna4
a/16x16 int4
b/ 16x32 int4
result 16x32 int32
vs rdna3 16x16x16 int4 ->int32

Does 16x32 mean doubled INT4 rate for RDNA4 compared to RDNA3?

adroc_thurston · Jan 5, 2025

gaav87 said:
For ML (fsr4)

Not here either.

Discussion RDNA4 + CDNA3 Architectures Thread

Golden Member

Senior member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Senior member

Senior member

Senior member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Senior member

Member

Diamond Member

Member

Diamond Member

Member

Senior member

Member

Diamond Member