Discussion RDNA4 + CDNA3 Architectures Thread

Page 428 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,774
6,754
136





With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits
Or Phoronix

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.



Previous thread on CDNA2 and RDNA3 here

 
Last edited:

Vikv1918

Junior Member
Mar 12, 2025
10
19
36
C&C has an article on RDNA4’s RT improvements:
Elden Ring is a strange game to test for this article. Its one of the worst RT implementations, the performance tanks a lot for only minimal RT effects. It runs just as bad on RDNA4 as it does on 3 and 2, or even on nvidia for that matter. In fact, if we look at Techpowerup benchmarks RDNA4 runs worse than RDNA3 lol.
 
Reactions: Racan

marees

Senior member
Apr 28, 2024
965
1,288
96
C&C has an article on RDNA4’s RT improvements:

RDNA 4’s Raytracing Improvements​


GPUs aren’t latency optimized, so trading latency-bound pointer chasing steps for more parallel compute requirements is a good strategy.

https://chipsandcheese.com/p/rdna-4s-raytracing-improvements

In a frame captured from 3DMark’s DXR feature test, which raytraces an entire scene with minimal rasterization, the Radeon RX 9070 sustained 111.76G and 19.61G box and triangle tests per second, respectively. For comparison the RDNA 2 based Radeon RX 6900XT did 38.8G and 10.76G box and triangle tests per second. Ballparking Ray Accelerator utilization is difficult due to variable clock speeds on both cards. But assuming 2.5 GHz gives 24% and 10.23% utilization figures for RDNA 4 and RDNA 2’s Ray Accelerators. RDNA 4 is therefore able to feed its bigger Ray Accelerator better than RDNA 2 could. AMD has done a lot since their first generation raytracing implementation, and the cumulative progress is impressive.
Click to expand...

Still, RDNA 4 has room for improvement. OBBs could be more flexible, and first level caches could be larger. Intel and Nvidia are obvious competitors too. Intel has revealed a lot about their raytracing implementation, and no raytracing discussion would be complete without keeping them in context. Intel’s Raytracing Accelerator (RTA) takes ownership of the traversal process and is tightly optimized for it, with a dedicated BVH cache and short stack kept in internal registers. It’s a larger hardware investment that doesn’t benefit general workloads, but does let Intel even more closely fit fixed function hardware to raytracing demands. Besides the obvious advantage from using dedicated caches/registers instead of RDNA 4’s general purpose caches and local data share, Intel can keep traversal off Xe Core thread slots, leaving them free for ray generation or result handling.

AMD’s approach has advantages of its own. Avoiding thread launches between raytracing pipeline steps can reduce latency. And raytracing code running on the programmable shader pipelines naturally takes advantage of their ability to track massive thread-level parallelism. As RDNA 4 and Intel’s Battlemage have shown, there’s plenty of room to improve within both strategies.
 
Reactions: Win2012R2

eek2121

Diamond Member
Aug 2, 2005
3,300
4,843
136

RDNA 4’s Raytracing Improvements​


GPUs aren’t latency optimized, so trading latency-bound pointer chasing steps for more parallel compute requirements is a good strategy.

https://chipsandcheese.com/p/rdna-4s-raytracing-improvements



Still, RDNA 4 has room for improvement. OBBs could be more flexible, and first level caches could be larger. Intel and Nvidia are obvious competitors too. Intel has revealed a lot about their raytracing implementation, and no raytracing discussion would be complete without keeping them in context. Intel’s Raytracing Accelerator (RTA) takes ownership of the traversal process and is tightly optimized for it, with a dedicated BVH cache and short stack kept in internal registers. It’s a larger hardware investment that doesn’t benefit general workloads, but does let Intel even more closely fit fixed function hardware to raytracing demands. Besides the obvious advantage from using dedicated caches/registers instead of RDNA 4’s general purpose caches and local data share, Intel can keep traversal off Xe Core thread slots, leaving them free for ray generation or result handling.

AMD’s approach has advantages of its own. Avoiding thread launches between raytracing pipeline steps can reduce latency. And raytracing code running on the programmable shader pipelines naturally takes advantage of their ability to track massive thread-level parallelism. As RDNA 4 and Intel’s Battlemage have shown, there’s plenty of room to improve within both strategies.
I’m actually a huge fan of the way AMD is approaching both RT and FSR. Rather than throwing tensor cores/fixed function hardware at the issue, they are simply expanding the capabilities of the architecture itself.

I haven’t paid close attention to Intel’s implementation of RT, but I think NVIDIA is the one that is doing it wrong. It is going to bite them in the rear end at some point. Quite a few devs want a fully programmable RT pipeline, and NVIDIA will be forced to do that in a very suboptimal way, or perhaps, not support it al all with older hardware.

Regarding FSR4, The same hardware that powers it can be used for other things as well. We probably won’t see much until a PS6 release, however, I expect we will see some stuff in the future.

The real issue is, of course, Microsoft. They should be launching new versions of DirectX with new features on a regular basis and then using that as a carrot on a stick to help accelerate GPU development. If they had been leading the way, FSR4, DLSS, etc would not exist, and RT implementation would be significantly improved.
 

GodisanAtheist

Diamond Member
Nov 16, 2006
7,851
8,931
136
Explain it to me like I'm 5: what would a fully programmable RT pipeline do? I'm guessing the usual answers more efficient, more performant RT calculations, but would it allow for more effects as well?

Programmable shaders sort of made sense since shaders are used for basically every visual element in the scene, but a programmable RT pipeline... lighting is lighting right?

Seems like a very specific task to make fully programable.
 
Reactions: marees

DisEnchantment

Golden Member
Mar 3, 2017
1,774
6,754
136
Maybe because full RT is insanely hard on compute so they want to control light rays per object as you may not want 100 light rays falling on something unimportant to gameplay and even the scene itself.

It is insanely hard on the memory and cache subsystem rather. Memory and cache subsystem has not been evolving at the same rate as on the DC for Client graphics.
Need a lot of Investment in all levels of the cache hierarchy. The stalls during BVH traversal are all memory bound.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |