Discussion RDNA4 + CDNA3 Architectures Thread

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,687
6,235
136





With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits
Or Phoronix

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.



Previous thread on CDNA2 and RDNA3 here

 
Last edited:

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
Is there anything preventing them from using dual GCDs with slightly lower, greener clocks to bruteforce their way to 4090+ levels of pixel pushing power? It's only 22% slower in raster than a 4090 on average (TPU).
Do you know how much performance would a second GCD provide at the same clocks?

If It provided only 50% then with 2GHz clockspeed, It would be only ~20% faster than a single N31. It would perform the same in raster as RTX 4090, but loose in RT and who knows what would be the TBP for such a chip, don't think It would be lower than 450W.

48GB Vram would be also a serious overkill and the production cost will be almost double of N31, while price won't be doubled.
Then Nvidia will release the full ADA102 with 15-20% higher performance and this dual GCD will be pointless.

Honestly, the second GCD would have to provide 75% extra performance and with 20% reduction in clockspeed, It could be on par with full ADA102 in raster.
 

Timorous

Golden Member
Oct 27, 2008
1,727
3,152
136
According to TPU, GTX 580 was 16% faster than GTX 480. It had 10% higher clocks, but max OC was comparable, the rest of the performance was from more shaders 512 vs 480. Power consumption dropped by 30W or 12%.
In case of fixed N31 we are supposedly talking about ~3.5GHz clocks at an unknown TBP.
RTX 7900XTX needs 406W just to work at 2545MHz in Cyberpunk 4K Ultra + RT. TPU
Fixing N31 looks a lot harder to me.

So a 31% perf/watt increase. That would exceed 4090 performance at 355W.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
Maybe not 48GB but they could concoct something with 32GB VRAM.
For 32GB Vram you need only 8 MCDs. N31 uses 6 MCDs. 2x N31 GCDs would mean 12 MCDs. I think they could put less of them on the package.
128MB IC and 33% higher BW should be enough for 40% higher performance.
So a 31% perf/watt increase. That would exceed 4090 performance at 355W.
Not sure how you calculated 31% better perf/W, but that doesn't really matter.
What matters is that absolute performance would also improve by the same amount because of much higher clocks, and that's what's unrealistic to achieve from the same chip and process. GTX 580 also didn't achieve such a feat.
 
Last edited:

Heartbreaker

Diamond Member
Apr 3, 2006
4,262
5,259
136
According to TPU, GTX 580 was 16% faster than GTX 480. It had 10% higher clocks, but max OC was comparable, the rest of the performance was from more shaders 512 vs 480. Power consumption dropped by 30W or 12%.
In case of fixed N31 we are supposedly talking about ~3.5GHz clocks at an unknown TBP.
RTX 7900XTX needs 406W just to work at 2545MHz in Cyberpunk 4K Ultra + RT. TPU
Fixing N31 looks a lot harder to me.

MUCH cheaper to respin a die back in those days.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
What would I like to see from RDNA4? Something like this:

Old MCD -> 16MB + 64-bit GDDR6
New MCD -> 24MB + 64-bit GDDR7
GDDR7 would be 3gbit modules.
20 CUs per shader engine.

ChipClockspeedShader
engine
CUTMUROPsInfinity CacheMemoryController widthBWVram
RX 7600N332655 MHz2321286432 MB18 gbps128-bit288 GB/s8 GB
RX 8600XTN433540 MHz
(+33.3%)
240
(+25%)
160
(+25%)
80
(+25%)
48 MB
(+50%)
30 gbps
(+66.7%)
128-bit480 GB/s
(+66.7%)
12 GB
(+50%)
RX 7800XTN322600 MHz36024096?64 MB18 gbps?256-bit576 GB/s16 GB
RX 8800XTN423250 MHz
(+25%)
480
(+33.3%)
320
(+33.3%)
128
(+33.3%)
96 MB
(+50%)
30 gbps
(+66.7%)
256-bit960 GB/s
(+66.7%)
24 GB
(+50%)
RX 7900XTXN312500 MHz69638419296 MB20 gbps384-bit960 GB/s24 GB
RX 8900XTXN413330 MHz
(+33.3%)
6120
(+25%)
480
(+25%)
240
(+25%)
144 MB
(+50%)
32 gbps
(+60%)
384-bit1536 GB/s
(+60%)
36 GB
(+50%)
I tried to have the same increase of TFLOPs, Texel and Pixel fillrate across the stack, that's why 8600XT has such a high clock.
I am not sure If It's possible for AMD GPU to have 80ROPs unless It's 2Shader engines with 5 render backends or 4 but capable of processing 10 32-bit pixels per cycle.


.
 
Last edited:
Reactions: Tlh97 and Joe NYC

GodisanAtheist

Diamond Member
Nov 16, 2006
7,062
7,487
136
I think AMD leans more toward wider/cheaper, rather than more narrow/expensive, and with the separate MCDs it make even more sense. Wider also makes it easier to have more memory.

I think AMD just does whatever.

R600 was wide bus normal RAM.

With R700 and Evergreen AMD went with narrow BUS fast ram.

With GCN Hawaii, AMD went fat ass BUS normal RAM.

With Fiji and Vega AMD threw all that crap out the window and went with ultra massive BUS + weird exotic memory.

Then we went with narrow and middling memory + Infinity Cache wild card with RDNA2.

Who knows what the hell AMD is going to do with RDNA4. They have clearly avoided GDDR6x memory, but they've never ignored mainline GDDR revisions and sometimes have completely left the reservation when it comes to RAM and bus sizes.
 

SteinFG

Senior member
Dec 29, 2021
521
610
106
RGT made a new video about RDNA4, says that it will continue to use a single GCD. Also says that the N41 has 60WGPs

60 WGPs is 1.25x core count increase, and if AMD, this time, can actually clock it high, maybe we'll see 1.5x over RDNA3? eh, not sure.

Also, 60WGPs means 6SEs, so, if N42 would follow the design principle of N32, it'll have 36 WGPs, edit: maybe 42 in the rare scenario.

But really, I have low confidence in RGT leaks. So take it as a guess at most. Writing this post to check it out later how wrong/right he is
 
Last edited:

Joe NYC

Platinum Member
Jun 26, 2021
2,331
2,942
106
RGT made a new video about RDNA4, says that it will continue to use a single GCD. Also says that the N41 has 60WGPs

60 WGPs is 1.25x core count increase, and if AMD, this time, can actually clock it high, maybe we'll see 1.5x over RDNA3? eh, not sure.

Also, 60WGPs means 6SEs, so, if N42 would follow the design principle of N32, it'll have 36 WGPs, edit: maybe 42 in the rare scenario.

But really, I have low confidence in RGT leaks. So take it as a guess at most. Writing this post to check it out later how wrong/right he is

I don't think I will be able to contain myself waiting 2 more years for a Navi 31 replica.

This will be too much excitement for me to handle.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
RGT made a new video about RDNA4, says that it will continue to use a single GCD. Also says that the N41 has 60WGPs

60 WGPs is 1.25x core count increase, and if AMD, this time, can actually clock it high, maybe we'll see 1.5x over RDNA3? eh, not sure.

Also, 60WGPs means 6SEs, so, if N42 would follow the design principle of N32, it'll have 36 WGPs, edit: maybe 42 in the rare scenario.

But really, I have low confidence in RGT leaks. So take it as a guess at most. Writing this post to check it out later how wrong/right he is
I think he just saw my post above about RDNA4 + GDDR7.

N42 having 72 or 84CUs would mean >20CUs per SE. RDNA2/3 didn't have more than 20CUs per SE. I more believe in 4xSE and 20CUs per SE.
If design was unchanged, then N41 would have to have 8SEs(2xN42) with 16CUs for a total of 128CUs or 33% more than N31. I don't think they would go for 8SE and 20CUs in a single GCD, that would be already a 67% increase compared to N31 without any clockspeed increase.
 

Timorous

Golden Member
Oct 27, 2008
1,727
3,152
136
RGT made a new video about RDNA4, says that it will continue to use a single GCD. Also says that the N41 has 60WGPs

60 WGPs is 1.25x core count increase, and if AMD, this time, can actually clock it high, maybe we'll see 1.5x over RDNA3? eh, not sure.

Also, 60WGPs means 6SEs, so, if N42 would follow the design principle of N32, it'll have 36 WGPs, edit: maybe 42 in the rare scenario.

But really, I have low confidence in RGT leaks. So take it as a guess at most. Writing this post to check it out later how wrong/right he is

I don't see 2x the Tflops like was claimed. That would need 4Ghz with 7680 shaders.

I do see > 1.5x the FPS though. 3Ghz N31 is around 17% faster than stock add in 25% more shaders and DDR7 and I think 1.5x raster is probably the minimum that AMD would be expecting.

Given RDNA3 is below par AMD might expect even more than that.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
I see 2 big problems with the GPU generations to come.
GDDR7 will be a gigantic help for RTX 80** and RX 8**0 series.
The question is what to use after that? Samsung announced 32gbps GDDR7 and last year talked about speeds up to 36gbps.
This won't be enough for RTX 90** and RX 9**0 series unless they widen controller width.

The next problem will be the power consumption.
Currently, we are at 450W with RTX 4090, If AMD didn't botch RDNA3, then the full ADA102 would be >500W.
Next gen will be what? 600W? The gen after 750W?
They can decrease clocks to make It more efficient, but then we will see only a mediocre increase in performance in desktop and let's not mention laptop chips, which are limited to 175W.
 

SteinFG

Senior member
Dec 29, 2021
521
610
106
sdf
I see 2 big problems with the GPU generations to come.
GDDR7 will be a gigantic help for RTX 80** and RX 8**0 series.
The question is what to use after that? Samsung announced 32gbps GDDR7 and last year talked about speeds up to 36gbps.
This won't be enough for RTX 90** and RX 9**0 series unless they widen controller width.

The next problem will be the power consumption.
Currently, we are at 450W with RTX 4090, If AMD didn't botch RDNA3, then the full ADA102 would be >500W.
Next gen will be what? 600W? The gen after 750W?
They can decrease clocks to make It more efficient, but then we will see only a mediocre increase in performance in desktop and let's not mention laptop chips, which are limited to 175W.
They'll just increase effective bandwidth.
I think AMD and NVidia calculate it like that:
(LLC bandwidth * LLC Hit rate) + VRAM speed
Hmm... if this is true, then it's a good news that N41 gpu is gonna catch up 4090, and maybe even 4090 ti.
It needs to catch up to 5090 lol
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
They'll just increase effective bandwidth.
I think AMD and NVidia calculate it like that:
(LLC bandwidth * LLC Hit rate) + VRAM speed
I think that effective BW is just marketing BS.


Averaging It is just nonsense, either you have the data in cache or you don't.
If you have It, then you have the maximum ~1940 GB/s BW in case of RDNA2 N21 or ~4470 5300 GB/s for RDNA3 N31.
If you don't then It leaves you with only GDDR6/7 BW.

P.S. I wonder how much IC would be needed for 90% hitrate at 4K. 1GB?
I think 4 stacks of Hynix HBM3E with 4TB/s(1TB/s per Stack) and 64-96GB Vram(16-24GB per Stack) could end up cheaper to make.

Edit: It looks like IC size matters for total BW.
I think what Locuza wrote as theoretical Iinfinity Cache BW is wrong for N23/24. N22 is 0.75 of N21 If we exclude hitrate, but N23 and N24 are not 1/4 and 1/8 of N21.
Then 1GB IC BW exluding hitrate would be 10.67x higher than N31 has?
 
Last edited:
Reactions: Joe NYC

DisEnchantment

Golden Member
Mar 3, 2017
1,687
6,235
136
Memory Bandwidth shouldn't be a problem for AMD (or NV) next year. 32GT/s GDDR7 on a 384 bit interface is a lot. Even if we assume the first gen GDDR7 to be unable to hit the PR value of 32 GT/s, at a lower 28 GT/s, that is an incredible 1344 GB/s. By next year GDDR6 should be able to hit ~1000 GB/s on a 384bit interface @22 GT/s.

The issue they would need to address is latency and power with regards to memory access. MALL is the solution they have to address the latency issue besides BW amplification. Remains to be seen how effective is it going to be, are they gonna increase the size, or perform prefetching, or new algo for eviction/retention, or improve the interconnect etc etc.

Right now they need to address many other problems besides the memory BW. Improving hit rates for L0 and L1. They may need to bump up the cache sizes and VGPRs once again. Improving the usefulness of the dual issue ops otherwise the theoretical TF are meaningless if it gets engaged 5% of the time in a real gaming load. Addressing the issues which came with the removal of the legacy geometry pipeline. Fixing the OREO/ROOE which seems to need a big bunch of workarounds in mesa.

RDNA3 was an ambitious project seems they failed to deliver. It could clock high but power consumption is a mess. Removal of the legacy geometry and OREO are also quite ambitious. Compute was a big uplift.
RDNA3 almost doubled the MTr/CU. They need to use it more effectively. Scaling to more SEs does not seem to be a problem, power likely was the problem.

Even they do nothing architecturally and just fix whatever problem they had, improve the power efficiency and clocks, move to new node, and scale it up by 1.33x. That itself should bring the 1.5x uplift before anything else.
 

udaemonia

Junior Member
Jul 23, 2023
2
0
6
I don't think I will be able to contain myself waiting 2 more years for a Navi 31 replica.

This will be too much excitement for me to handle.
Rick Bergman and David Wang stated that cost was a major factor in avoiding direct competition with the 4090. They're going to commit to the design and improve everywhere they can.

 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
If AMD were ambitious, they would have created a cryo-cooler to run RDNA3 at 3.5 GHz and could have sold it for $1999.
Even If we ignored the ridiculous power consumption, with 3.5GHz clockspeed It would be faster than RTX 4090 but slower than Full Ada102.
RT would be slower than RTX 4090 but faster than RTX 4080.
They could not ask $1999 for It.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
Rick Bergman and David Wang stated that cost was a major factor in avoiding direct competition with the 4090. They're going to commit to the design and improve everywhere they can.

This interview is nonsense. AMD botched RDNA3 which resulted in much lower clocks along with performance than intended, so they had to sell It for less.
If It could clock higher at acceptable TBP, then they would never sell It for only $999.
 

udaemonia

Junior Member
Jul 23, 2023
2
0
6
This interview is nonsense. AMD botched RDNA3 which resulted in much lower clocks along with performance than intended, so they had to sell It for less.
If It could clock higher at acceptable TBP, then they would never sell It for only $999.
They incurred massive development costs for the new chiplet design, so why not minimize expenses and transfer onto future designs, as they improve it? Plus they need hand-optimization to take advantage of the dual issue throughput, its an efficient design that can improve.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
They incurred massive development costs for the new chiplet design, so why not minimize expenses and transfer onto future designs, as they improve it?
They need selling GPUs first and have high profit on them If possible.

By making a ~402mm2 GCD, It should allow 8 SE, 160 CU, 10240 Shaders, 640 TMU, 256 ROPs while costing ~$130 vs ~$102 for the whole chip including 6 MCDs.
Performance would be a lot better even with reduced clocks to keep power in check, so they could ask a lot more than $999 for this, which would easily cover the higher production cost.

Plus they need hand-optimization to take advantage of the dual issue throughput, its an efficient design that can improve.
It's a stupid design, not efficient. Nvidia with Ampere also added more FP32 units, which resulted in 25-30% higher performance, which you got from the beginning. This dual-issue is a total flop in comparison.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |