Discussion RDNA4 + CDNA3 Architectures Thread

DisEnchantment · Mar 23, 2022

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits

History for llvm/lib/Target/AMDGPU - llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - History for llvm/lib/Target/AMDGPU - llvm/llvm-project

github.com

Or Phoronix

More AMD "GFX940" Enablement Work Landing In LLVM - Phoronix

www.phoronix.com

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.

Previous thread on CDNA2 and RDNA3 here

Question - Speculation: RDNA3 + CDNA2 Architectures Thread

Man I have been dying to make this one for a while now. First rumours for RDNA3 are here so new thread time! Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3...

forums.anandtech.com

Vikv1918 · 2025-04-15T00:54:48-0400

Saylick said:
C&C has an article on RDNA4’s RT improvements:

RDNA 4’s Raytracing Improvements

Raytraced effects have gained increasing adoption in AAA titles, adding an extra graphics quality tier beyond traditional “ultra” settings.

chipsandcheese.com

Elden Ring is a strange game to test for this article. Its one of the worst RT implementations, the performance tanks a lot for only minimal RT effects. It runs just as bad on RDNA4 as it does on 3 and 2, or even on nvidia for that matter. In fact, if we look at Techpowerup benchmarks RDNA4 runs worse than RDNA3 lol.

marees · 2025-04-15T02:34:00-0400

Saylick said:
C&C has an article on RDNA4’s RT improvements:

RDNA 4’s Raytracing Improvements

Raytraced effects have gained increasing adoption in AAA titles, adding an extra graphics quality tier beyond traditional “ultra” settings.

chipsandcheese.com

RDNA 4’s Raytracing Improvements

GPUs aren’t latency optimized, so trading latency-bound pointer chasing steps for more parallel compute requirements is a good strategy.

https://chipsandcheese.com/p/rdna-4s-raytracing-improvements

In a frame captured from 3DMark’s DXR feature test, which raytraces an entire scene with minimal rasterization, the Radeon RX 9070 sustained 111.76G and 19.61G box and triangle tests per second, respectively. For comparison the RDNA 2 based Radeon RX 6900XT did 38.8G and 10.76G box and triangle tests per second. Ballparking Ray Accelerator utilization is difficult due to variable clock speeds on both cards. But assuming 2.5 GHz gives 24% and 10.23% utilization figures for RDNA 4 and RDNA 2’s Ray Accelerators. RDNA 4 is therefore able to feed its bigger Ray Accelerator better than RDNA 2 could. AMD has done a lot since their first generation raytracing implementation, and the cumulative progress is impressive.
Click to expand...

Still, RDNA 4 has room for improvement. OBBs could be more flexible, and first level caches could be larger. Intel and Nvidia are obvious competitors too. Intel has revealed a lot about their raytracing implementation, and no raytracing discussion would be complete without keeping them in context. Intel’s Raytracing Accelerator (RTA) takes ownership of the traversal process and is tightly optimized for it, with a dedicated BVH cache and short stack kept in internal registers. It’s a larger hardware investment that doesn’t benefit general workloads, but does let Intel even more closely fit fixed function hardware to raytracing demands. Besides the obvious advantage from using dedicated caches/registers instead of RDNA 4’s general purpose caches and local data share, Intel can keep traversal off Xe Core thread slots, leaving them free for ray generation or result handling.

AMD’s approach has advantages of its own. Avoiding thread launches between raytracing pipeline steps can reduce latency. And raytracing code running on the programmable shader pipelines naturally takes advantage of their ability to track massive thread-level parallelism. As RDNA 4 and Intel’s Battlemage have shown, there’s plenty of room to improve within both strategies.

eek2121 · 2025-04-15T07:27:00-0400

marees said:
RDNA 4’s Raytracing Improvements

GPUs aren’t latency optimized, so trading latency-bound pointer chasing steps for more parallel compute requirements is a good strategy.

https://chipsandcheese.com/p/rdna-4s-raytracing-improvements

Still, RDNA 4 has room for improvement. OBBs could be more flexible, and first level caches could be larger. Intel and Nvidia are obvious competitors too. Intel has revealed a lot about their raytracing implementation, and no raytracing discussion would be complete without keeping them in context. Intel’s Raytracing Accelerator (RTA) takes ownership of the traversal process and is tightly optimized for it, with a dedicated BVH cache and short stack kept in internal registers. It’s a larger hardware investment that doesn’t benefit general workloads, but does let Intel even more closely fit fixed function hardware to raytracing demands. Besides the obvious advantage from using dedicated caches/registers instead of RDNA 4’s general purpose caches and local data share, Intel can keep traversal off Xe Core thread slots, leaving them free for ray generation or result handling.

AMD’s approach has advantages of its own. Avoiding thread launches between raytracing pipeline steps can reduce latency. And raytracing code running on the programmable shader pipelines naturally takes advantage of their ability to track massive thread-level parallelism. As RDNA 4 and Intel’s Battlemage have shown, there’s plenty of room to improve within both strategies.

I’m actually a huge fan of the way AMD is approaching both RT and FSR. Rather than throwing tensor cores/fixed function hardware at the issue, they are simply expanding the capabilities of the architecture itself.

I haven’t paid close attention to Intel’s implementation of RT, but I think NVIDIA is the one that is doing it wrong. It is going to bite them in the rear end at some point. Quite a few devs want a fully programmable RT pipeline, and NVIDIA will be forced to do that in a very suboptimal way, or perhaps, not support it al all with older hardware.

Regarding FSR4, The same hardware that powers it can be used for other things as well. We probably won’t see much until a PS6 release, however, I expect we will see some stuff in the future.

The real issue is, of course, Microsoft. They should be launching new versions of DirectX with new features on a regular basis and then using that as a carrot on a stick to help accelerate GPU development. If they had been leading the way, FSR4, DLSS, etc would not exist, and RT implementation would be significantly improved.

GodisanAtheist · 2025-04-15T11:36:43-0400

Explain it to me like I'm 5: what would a fully programmable RT pipeline do? I'm guessing the usual answers more efficient, more performant RT calculations, but would it allow for more effects as well?

Programmable shaders sort of made sense since shaders are used for basically every visual element in the scene, but a programmable RT pipeline... lighting is lighting right?

Seems like a very specific task to make fully programable.

igor_kavinski · 2025-04-15T11:41:39-0400

GodisanAtheist said:
Seems like a very specific task to make fully programable.

Maybe because full RT is insanely hard on compute so they want to control light rays per object as you may not want 100 light rays falling on something unimportant to gameplay and even the scene itself.

DisEnchantment · 2025-04-15T12:04:03-0400

igor_kavinski said:
Maybe because full RT is insanely hard on compute so they want to control light rays per object as you may not want 100 light rays falling on something unimportant to gameplay and even the scene itself.

It is insanely hard on the memory and cache subsystem rather. Memory and cache subsystem has not been evolving at the same rate as on the DC for Client graphics.
Need a lot of Investment in all levels of the cache hierarchy. The stalls during BVH traversal are all memory bound.

Discussion RDNA4 + CDNA3 Architectures Thread

DisEnchantment

Golden Member

History for llvm/lib/Target/AMDGPU - llvm/llvm-project

More AMD "GFX940" Enablement Work Landing In LLVM - Phoronix

Question - Speculation: RDNA3 + CDNA2 Architectures Thread

Vikv1918

Junior Member

RDNA 4’s Raytracing Improvements

marees

Senior member

RDNA 4’s Raytracing Improvements

RDNA 4’s Raytracing Improvements

eek2121

Diamond Member

RDNA 4’s Raytracing Improvements

GodisanAtheist

Diamond Member

igor_kavinski

Lifer

DisEnchantment

Golden Member

TRENDING THREADS

Discussion RDNA4 + CDNA3 Architectures Thread

Golden Member

Junior Member

Senior member

RDNA 4’s Raytracing Improvements​

Diamond Member

RDNA 4’s Raytracing Improvements​

Diamond Member

Lifer

Golden Member

RDNA 4’s Raytracing Improvements

RDNA 4’s Raytracing Improvements