Discussion RDNA4 + CDNA3 Architectures Thread

DisEnchantment · Mar 23, 2022

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits

History for llvm/lib/Target/AMDGPU - llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - History for llvm/lib/Target/AMDGPU - llvm/llvm-project

github.com

Or Phoronix

More AMD "GFX940" Enablement Work Landing In LLVM - Phoronix

www.phoronix.com

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it

This is nuts, MI100/200/300 cadence is impressive.

Previous thread on CDNA2 and RDNA3 here

Question - Speculation: RDNA3 + CDNA2 Architectures Thread

Man I have been dying to make this one for a while now. First rumours for RDNA3 are here so new thread time! Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3...

forums.anandtech.com

TESKATLIPOKA · Jul 19, 2023

igor_kavinski said:
Is there anything preventing them from using dual GCDs with slightly lower, greener clocks to bruteforce their way to 4090+ levels of pixel pushing power? It's only 22% slower in raster than a 4090 on average (TPU).

Do you know how much performance would a second GCD provide at the same clocks?

If It provided only 50% then with 2GHz clockspeed, It would be only ~20% faster than a single N31. It would perform the same in raster as RTX 4090, but loose in RT and who knows what would be the TBP for such a chip, don't think It would be lower than 450W.

48GB Vram would be also a serious overkill and the production cost will be almost double of N31, while price won't be doubled.
Then Nvidia will release the full ADA102 with 15-20% higher performance and this dual GCD will be pointless.

Honestly, the second GCD would have to provide 75% extra performance and with 20% reduction in clockspeed, It could be on par with full ADA102 in raster.

Timorous · Jul 19, 2023

TESKATLIPOKA said:
According to TPU, GTX 580 was 16% faster than GTX 480. It had 10% higher clocks, but max OC was comparable, the rest of the performance was from more shaders 512 vs 480. Power consumption dropped by 30W or 12%.
In case of fixed N31 we are supposedly talking about ~3.5GHz clocks at an unknown TBP.
RTX 7900XTX needs 406W just to work at 2545MHz in Cyberpunk 4K Ultra + RT. TPU
Fixing N31 looks a lot harder to me.

So a 31% perf/watt increase. That would exceed 4090 performance at 355W.

igor_kavinski · Jul 19, 2023

TESKATLIPOKA said:
48GB Vram would be also a serious overkill and the production cost will be almost double of N31, while price won't be doubled.

Maybe not 48GB but they could concoct something with 32GB VRAM.

TESKATLIPOKA · Jul 19, 2023

igor_kavinski said:
Maybe not 48GB but they could concoct something with 32GB VRAM.

For 32GB Vram you need only 8 MCDs. N31 uses 6 MCDs. 2x N31 GCDs would mean 12 MCDs. I think they could put less of them on the package.
128MB IC and 33% higher BW should be enough for 40% higher performance.

Timorous said:
So a 31% perf/watt increase. That would exceed 4090 performance at 355W.

Not sure how you calculated 31% better perf/W, but that doesn't really matter.
What matters is that absolute performance would also improve by the same amount because of much higher clocks, and that's what's unrealistic to achieve from the same chip and process. GTX 580 also didn't achieve such a feat.

Heartbreaker · Jul 19, 2023

TESKATLIPOKA said:
According to TPU, GTX 580 was 16% faster than GTX 480. It had 10% higher clocks, but max OC was comparable, the rest of the performance was from more shaders 512 vs 480. Power consumption dropped by 30W or 12%.
In case of fixed N31 we are supposedly talking about ~3.5GHz clocks at an unknown TBP.
RTX 7900XTX needs 406W just to work at 2545MHz in Cyberpunk 4K Ultra + RT. TPU
Fixing N31 looks a lot harder to me.

MUCH cheaper to respin a die back in those days.

Heartbreaker · Jul 19, 2023

TESKATLIPOKA said:
Do you know how much performance would a second GCD provide at the same clocks?

Not enough to matter, CF is dead, and doing this would still need CF.

TESKATLIPOKA · Jul 19, 2023

What would I like to see from RDNA4? Something like this:

Old MCD -> 16MB + 64-bit GDDR6
New MCD -> 24MB + 64-bit GDDR7
GDDR7 would be 3gbit modules.
20 CUs per shader engine.

	Chip	Clockspeed	Shader engine	CU	TMU	ROPs	Infinity Cache	Memory	Controller width	BW	Vram
RX 7600	N33	2655 MHz	2	32	128	64	32 MB	18 gbps	128-bit	288 GB/s	8 GB
RX 8600XT	N43	3540 MHz (+33.3%)	2	40 (+25%)	160 (+25%)	80 (+25%)	48 MB (+50%)	30 gbps (+66.7%)	128-bit	480 GB/s (+66.7%)	12 GB (+50%)
RX 7800XT	N32	2600 MHz	3	60	240	96?	64 MB	18 gbps?	256-bit	576 GB/s	16 GB
RX 8800XT	N42	3250 MHz (+25%)	4	80 (+33.3%)	320 (+33.3%)	128 (+33.3%)	96 MB (+50%)	30 gbps (+66.7%)	256-bit	960 GB/s (+66.7%)	24 GB (+50%)
RX 7900XTX	N31	2500 MHz	6	96	384	192	96 MB	20 gbps	384-bit	960 GB/s	24 GB
RX 8900XTX	N41	3330 MHz (+33.3%)	6	120 (+25%)	480 (+25%)	240 (+25%)	144 MB (+50%)	32 gbps (+60%)	384-bit	1536 GB/s (+60%)	36 GB (+50%)

I tried to have the same increase of TFLOPs, Texel and Pixel fillrate across the stack, that's why 8600XT has such a high clock.
I am not sure If It's possible for AMD GPU to have 80ROPs unless It's 2Shader engines with 5 render backends or 4 but capable of processing 10 32-bit pixels per cycle.

.

GodisanAtheist · Jul 19, 2023

Heartbreaker said:
I think AMD leans more toward wider/cheaper, rather than more narrow/expensive, and with the separate MCDs it make even more sense. Wider also makes it easier to have more memory.

I think AMD just does whatever.

R600 was wide bus normal RAM.

With R700 and Evergreen AMD went with narrow BUS fast ram.

With GCN Hawaii, AMD went fat ass BUS normal RAM.

With Fiji and Vega AMD threw all that crap out the window and went with ultra massive BUS + weird exotic memory.

Then we went with narrow and middling memory + Infinity Cache wild card with RDNA2.

Who knows what the hell AMD is going to do with RDNA4. They have clearly avoided GDDR6x memory, but they've never ignored mainline GDDR revisions and sometimes have completely left the reservation when it comes to RAM and bus sizes.

SteinFG · Jul 22, 2023

RGT made a new video about RDNA4, says that it will continue to use a single GCD. Also says that the N41 has 60WGPs

60 WGPs is 1.25x core count increase, and if AMD, this time, can actually clock it high, maybe we'll see 1.5x over RDNA3? eh, not sure.

Also, 60WGPs means 6SEs, so, if N42 would follow the design principle of N32, it'll have 36 WGPs, edit: maybe 42 in the rare scenario.

But really, I have low confidence in RGT leaks. So take it as a guess at most. Writing this post to check it out later how wrong/right he is

Joe NYC · Jul 22, 2023

SteinFG said:
RGT made a new video about RDNA4, says that it will continue to use a single GCD. Also says that the N41 has 60WGPs

60 WGPs is 1.25x core count increase, and if AMD, this time, can actually clock it high, maybe we'll see 1.5x over RDNA3? eh, not sure.

Also, 60WGPs means 6SEs, so, if N42 would follow the design principle of N32, it'll have 36 WGPs, edit: maybe 42 in the rare scenario.

But really, I have low confidence in RGT leaks. So take it as a guess at most. Writing this post to check it out later how wrong/right he is

I don't think I will be able to contain myself waiting 2 more years for a Navi 31 replica.

This will be too much excitement for me to handle.

TESKATLIPOKA · Jul 23, 2023

SteinFG said:
RGT made a new video about RDNA4, says that it will continue to use a single GCD. Also says that the N41 has 60WGPs

60 WGPs is 1.25x core count increase, and if AMD, this time, can actually clock it high, maybe we'll see 1.5x over RDNA3? eh, not sure.

Also, 60WGPs means 6SEs, so, if N42 would follow the design principle of N32, it'll have 36 WGPs, edit: maybe 42 in the rare scenario.

But really, I have low confidence in RGT leaks. So take it as a guess at most. Writing this post to check it out later how wrong/right he is

I think he just saw my post above about RDNA4 + GDDR7.

N42 having 72 or 84CUs would mean >20CUs per SE. RDNA2/3 didn't have more than 20CUs per SE. I more believe in 4xSE and 20CUs per SE.
If design was unchanged, then N41 would have to have 8SEs(2xN42) with 16CUs for a total of 128CUs or 33% more than N31. I don't think they would go for 8SE and 20CUs in a single GCD, that would be already a 67% increase compared to N31 without any clockspeed increase.

Timorous · Jul 23, 2023

SteinFG said:
RGT made a new video about RDNA4, says that it will continue to use a single GCD. Also says that the N41 has 60WGPs

60 WGPs is 1.25x core count increase, and if AMD, this time, can actually clock it high, maybe we'll see 1.5x over RDNA3? eh, not sure.

Also, 60WGPs means 6SEs, so, if N42 would follow the design principle of N32, it'll have 36 WGPs, edit: maybe 42 in the rare scenario.

But really, I have low confidence in RGT leaks. So take it as a guess at most. Writing this post to check it out later how wrong/right he is

I don't see 2x the Tflops like was claimed. That would need 4Ghz with 7680 shaders.

I do see > 1.5x the FPS though. 3Ghz N31 is around 17% faster than stock add in 25% more shaders and DDR7 and I think 1.5x raster is probably the minimum that AMD would be expecting.

Given RDNA3 is below par AMD might expect even more than that.

TESKATLIPOKA · Jul 23, 2023

I see 2 big problems with the GPU generations to come.
GDDR7 will be a gigantic help for RTX 80** and RX 8**0 series.
The question is what to use after that? Samsung announced 32gbps GDDR7 and last year talked about speeds up to 36gbps.
This won't be enough for RTX 90** and RX 9**0 series unless they widen controller width.

The next problem will be the power consumption.
Currently, we are at 450W with RTX 4090, If AMD didn't botch RDNA3, then the full ADA102 would be >500W.
Next gen will be what? 600W? The gen after 750W?
They can decrease clocks to make It more efficient, but then we will see only a mediocre increase in performance in desktop and let's not mention laptop chips, which are limited to 175W.

PJVol · Jul 23, 2023

Hmm... if this is true, then it's a good news that N41 gpu is gonna catch up 4090, and maybe even 4090 ti.

SteinFG · Jul 23, 2023

sdf

TESKATLIPOKA said:
I see 2 big problems with the GPU generations to come.
GDDR7 will be a gigantic help for RTX 80** and RX 8**0 series.
The question is what to use after that? Samsung announced 32gbps GDDR7 and last year talked about speeds up to 36gbps.
This won't be enough for RTX 90** and RX 9**0 series unless they widen controller width.

The next problem will be the power consumption.
Currently, we are at 450W with RTX 4090, If AMD didn't botch RDNA3, then the full ADA102 would be >500W.
Next gen will be what? 600W? The gen after 750W?
They can decrease clocks to make It more efficient, but then we will see only a mediocre increase in performance in desktop and let's not mention laptop chips, which are limited to 175W.

They'll just increase effective bandwidth.
I think AMD and NVidia calculate it like that:
(LLC bandwidth * LLC Hit rate) + VRAM speed

PJVol said:
Hmm... if this is true, then it's a good news that N41 gpu is gonna catch up 4090, and maybe even 4090 ti.

It needs to catch up to 5090 lol

TESKATLIPOKA · Jul 23, 2023

SteinFG said:
They'll just increase effective bandwidth.
I think AMD and NVidia calculate it like that:
(LLC bandwidth * LLC Hit rate) + VRAM speed

I think that effective BW is just marketing BS.

Averaging It is just nonsense, either you have the data in cache or you don't.
If you have It, then you have the maximum ~1940 GB/s BW in case of RDNA2 N21 or ~~~4470~~ 5300 GB/s for RDNA3 N31.
If you don't then It leaves you with only GDDR6/7 BW.

P.S. I wonder how much IC would be needed for 90% hitrate at 4K. 1GB?
I think 4 stacks of Hynix HBM3E with 4TB/s(1TB/s per Stack) and 64-96GB Vram(16-24GB per Stack) could end up cheaper to make.

Edit: It looks like IC size matters for total BW.
I think what Locuza wrote as theoretical Iinfinity Cache BW is wrong for N23/24. N22 is 0.75 of N21 If we exclude hitrate, but N23 and N24 are not 1/4 and 1/8 of N21.
Then 1GB IC BW exluding hitrate would be 10.67x higher than N31 has?

DisEnchantment · Jul 23, 2023

Memory Bandwidth shouldn't be a problem for AMD (or NV) next year. 32GT/s GDDR7 on a 384 bit interface is a lot. Even if we assume the first gen GDDR7 to be unable to hit the PR value of 32 GT/s, at a lower 28 GT/s, that is an incredible 1344 GB/s. By next year GDDR6 should be able to hit ~1000 GB/s on a 384bit interface @22 GT/s.

The issue they would need to address is latency and power with regards to memory access. MALL is the solution they have to address the latency issue besides BW amplification. Remains to be seen how effective is it going to be, are they gonna increase the size, or perform prefetching, or new algo for eviction/retention, or improve the interconnect etc etc.

Right now they need to address many other problems besides the memory BW. Improving hit rates for L0 and L1. They may need to bump up the cache sizes and VGPRs once again. Improving the usefulness of the dual issue ops otherwise the theoretical TF are meaningless if it gets engaged 5% of the time in a real gaming load. Addressing the issues which came with the removal of the legacy geometry pipeline. Fixing the OREO/ROOE which seems to need a big bunch of workarounds in mesa.

RDNA3 was an ambitious project seems they failed to deliver. It could clock high but power consumption is a mess. Removal of the legacy geometry and OREO are also quite ambitious. Compute was a big uplift.
RDNA3 almost doubled the MTr/CU. They need to use it more effectively. Scaling to more SEs does not seem to be a problem, power likely was the problem.

Even they do nothing architecturally and just fix whatever problem they had, improve the power efficiency and clocks, move to new node, and scale it up by 1.33x. That itself should bring the 1.5x uplift before anything else.

PJVol · Jul 23, 2023

SteinFG said:
It needs to catch up to 5090 lol

Oh... really? Didn't know that

igor_kavinski · Jul 23, 2023

If AMD were ambitious, they would have created a cryo-cooler to run RDNA3 at 3.5 GHz and could have sold it for $1999.

udaemonia · Jul 23, 2023

Joe NYC said:
I don't think I will be able to contain myself waiting 2 more years for a Navi 31 replica.

This will be too much excitement for me to handle.

Rick Bergman and David Wang stated that cost was a major factor in avoiding direct competition with the 4090. They're going to commit to the design and improve everywhere they can.

AMD: We Didn't Make RDNA 3 as Fast as RTX 4090 to Keep Costs and Power Down | Hardware Times

In an interview with Japanese outlet ITMedia, AMD Executives Rick Bergman and David Wang shared Team Radeon’s opinions on the state of the next-gen GPU market. As noted in our review of the Radeon RX 7900 XT, the RDNA 3 heavyweight is substantially slower than the GeForce RTX 4080 in ray-traced...

www.hardwaretimes.com

TESKATLIPOKA · Jul 23, 2023

igor_kavinski said:
If AMD were ambitious, they would have created a cryo-cooler to run RDNA3 at 3.5 GHz and could have sold it for $1999.

Even If we ignored the ridiculous power consumption, with 3.5GHz clockspeed It would be faster than RTX 4090 but slower than Full Ada102.
RT would be slower than RTX 4090 but faster than RTX 4080.
They could not ask $1999 for It.

TESKATLIPOKA · Jul 23, 2023

udaemonia said:
Rick Bergman and David Wang stated that cost was a major factor in avoiding direct competition with the 4090. They're going to commit to the design and improve everywhere they can.

AMD: We Didn't Make RDNA 3 as Fast as RTX 4090 to Keep Costs and Power Down | Hardware Times

In an interview with Japanese outlet ITMedia, AMD Executives Rick Bergman and David Wang shared Team Radeon’s opinions on the state of the next-gen GPU market. As noted in our review of the Radeon RX 7900 XT, the RDNA 3 heavyweight is substantially slower than the GeForce RTX 4080 in ray-traced...

www.hardwaretimes.com

This interview is nonsense. AMD botched RDNA3 which resulted in much lower clocks along with performance than intended, so they had to sell It for less.
If It could clock higher at acceptable TBP, then they would never sell It for only $999.

udaemonia · Jul 23, 2023

TESKATLIPOKA said:
This interview is nonsense. AMD botched RDNA3 which resulted in much lower clocks along with performance than intended, so they had to sell It for less.
If It could clock higher at acceptable TBP, then they would never sell It for only $999.

They incurred massive development costs for the new chiplet design, so why not minimize expenses and transfer onto future designs, as they improve it? Plus they need hand-optimization to take advantage of the dual issue throughput, its an efficient design that can improve.

igor_kavinski · Jul 23, 2023

TESKATLIPOKA said:
They could not ask $1999 for It.

Introduce at that price. Once the AMD loyalists all have one, they can reduce the price to $1599 or less.

TESKATLIPOKA · Jul 23, 2023

udaemonia said:
They incurred massive development costs for the new chiplet design, so why not minimize expenses and transfer onto future designs, as they improve it?

They need selling GPUs first and have high profit on them If possible.

By making a ~402mm2 GCD, It should allow 8 SE, 160 CU, 10240 Shaders, 640 TMU, 256 ROPs while costing ~$130 vs ~$102 for the whole chip including 6 MCDs.
Performance would be a lot better even with reduced clocks to keep power in check, so they could ask a lot more than $999 for this, which would easily cover the higher production cost.

Plus they need hand-optimization to take advantage of the dual issue throughput, its an efficient design that can improve.

It's a stupid design, not efficient. Nvidia with Ampere also added more FP32 units, which resulted in 25-30% higher performance, which you got from the beginning. This dual-issue is a total flop in comparison.

Discussion RDNA4 + CDNA3 Architectures Thread

Golden Member

Platinum Member

Golden Member

Lifer

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Senior member

Platinum Member

Platinum Member

Golden Member

Platinum Member

Senior member

Senior member

Platinum Member

Golden Member

Senior member

Lifer

Junior Member

Platinum Member

Platinum Member

Junior Member

Lifer

Platinum Member