Question Speculation: RDNA3 + CDNA2 Architectures Thread

uzzi38 · Jan 23, 2021

Man I have been dying to make this one for a while now.

First rumours for RDNA3 are here so new thread time!

Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3 is much bigger than from RDNA1 to RDNA2. We should expect many big improvements in GFX11. 🤔" / Twitter

TESKATLIPOKA · Jul 4, 2022

leoneazzurro said:
Yes, that is why I said that efficiency in term of real throughput / number of FP32 units will likely decrease. But total throughput per CU would be significantly higher.

What you wrote was a bit different.

But if this is really doubled, well... N31 would be a 24576 FP32 unit GPU, with clocks supposedly higher than RDNA2. That would be a 122TF@2500MHz card.
Granted, as these units would do also INT and other operations their efficiency would not improve -in that regard- respect to RDNA2.

Now back to RDNA3.
What AMD did by adding another FP32 vALU is kinda similar to Nvidia's Ampere.
RDNA3 is capable of 2x FP32(INT32) OPS per circle per SIMD32 compared to RDNA2. Ampere vs Turing has 2x more FP32 units, but 1/2 of these units are FP32+INT32.
Ampere gained 25-30% more performance per SM(streaming multiprocessor ) by doing this, so I expect 35-40% in RDNA3.
RDNA2 WGP vs RDNA3 WGP
2 CU per WGP vs 2 CU per WGP
2 SIMD32 per CU vs 4 SIMD32 per CU
2 vALU32 per CU vs 8 vALU32 per CU
In theory, 4x more FLOPs, but real CU performance should be ~2.7-2.8x.
Now I have to wonder If Phoenix(Point) APU has 6 WGPs or only 3 WGPs. I hope It's the former.

Aapje · Jul 4, 2022

TESKATLIPOKA said:
Ampere gained 25-30% more performance per SM(streaming multiprocessor ) by doing this, so I expect 35-40% in RDNA3.

That would explain where AMD is getting that rumored huge performance boost from that allows them to overtake Nvidia. Basically, AMD was already more efficient except for this change, that they missed out on. So if they copy this and Nvidia doesn't have a new trick, they get ahead.

Although Nvidia also seems to be copying the big cache/small bus setup of AMD, which may boost Nvidia again, at least on 1080p/1440p.

leoneazzurro · Jul 4, 2022

Yes, wanted to write that and forgot to mention it, then I was convinced to have given that meaning. Never write when you have other things to do at the same time, sigh.

TESKATLIPOKA · Jul 4, 2022

Aapje said:
That would explain where AMD is getting that rumored huge performance boost from that allows them to overtake Nvidia. Basically, AMD was already more efficient except for this change, that they missed out on.
.......

I always thought this was a big advantage of Ampere.
If we compared GPUs with comparable number of CUs and SMs, then RDNA2 needed to have 25-30% higher clocks to be on par with Ampere.
This of course meant a lot worse power consumption at too high clocks, RX 6700XT or 6500XT was a great example of this. Laptop RDNA2 had this problem too. Low number of CUs was compensated by high clocks. Not good for power efficiency.

I made a comparison for Phoenix (Point) IGP vs Rembrandt IGP.
3(6) WGP vs 6 WGP
6(12) CU vs 12 CU
24(48) SIMD32 vs 24 SIMD32
48(96) vALU32 vs 24 vALU32
1536(3072) FP32 vs 768 FP32
At 2.4GHz It would mean 7,373(14,746) GFLOPs vs 3,686 GFLOPs

At the same clockspeed and number of WGPs It could be ~170-180% faster than Rembrandt, with only half of WGPs It could be only ~35-40% faster.

TESKATLIPOKA · Jul 4, 2022

I also made a comparison of N21 vs N33. Just for fun.
To have comparable TFLOPs:
N21(40 WGP; 80 CU; 2250MHz) -> 80*64*2*2250 = 23 TFlops
vs
N33v1(12 WGP; 24 CU; 1872MHz) -> 24*256*2*1872 = 23 TFlops
or
N33v2(10 WGP; 20 CU; 2250MHz) -> 20*256*2*2250 = 23 TFlops
or
N33v3(8 WGP; 16 CU; 2810MHz) -> 16*256*2*2810 = 23 TFlops

To have comparable performance:
N33v1(12 WGP; 24 CU) -> 65-67 CU RDNA2 at similar clockspeeds. Needs to be clocked at 2687-2769 MHz to perform as N21.
or
N33v2(10 WGP; 20 CU) -> 54-56 CU RDNA2 at similar clockspeeds. Needs to be clocked at 3215-3330 MHz to perform as N21.
or
N33v3(8 WGP; 16 CU) -> 43-45 CU RDNA2 at similar clockspeeds. Needs to be clocked at 4000-4186 MHz to perform as N21.

leoneazzurro · Jul 4, 2022

Rumor was about N33 having 32WGP.
As N33 should have also half the memory bandwidth (and supposedly same amount of IC) than N21, it should need more TF for achieving the same performance.

DisEnchantment · Jul 4, 2022

Looks like someone push N31 UMC patch then recalled it immediately.

[PATCH] drm/amdgpu: add umc ras functions for navi31

Add umc ras functions for navi31:
1. Add driver and asic register for umc new ip.
2. Support query umc ras error counter.
3. Support ras umc ue error address remapping.

+/* number of umc channel instance with memory map register access */
+#define UMC_V8_10_CHANNEL_INSTANCE_NUM 2
+/* number of umc instance with memory map register access */
+#define UMC_V8_10_UMC_INSTANCE_NUM 2
+/* number of mcd instance with memory map register access */
+#define UMC_V8_10_MCD_INSTANCE_NUM 6

6 MCD * 2 UMC/MCD * 2 CH/UMC --> 384-bit bus (Each UMC is 32 bit wide and 2 Channel per UMC for GDDR6 and each MCD is 64-bit wide)

Hopefully not a jebait like this one for N21

#define UMC_V8_7_HBM_MEMORY_CHANNEL_WIDTH 128

TESKATLIPOKA · Jul 4, 2022

leoneazzurro said:
Rumor was about N33 having 32WGP.
As N33 should have also half the memory bandwidth (and supposedly same amount of IC) than N21, it should need more TF for achieving the same performance.

That's why I wrote comparisons for both TFLOPs and actual performance.

Who mentioned 32 WGPs? Bondrewd mentioned 30 WGPs per GCD(N31) and that N33's ALU count is identical to N21 Link. On the other hand, Greymon55 was talking about 4096 ALUs.
And here is a problem.
If Bondrewd counted the second vALU32, then It would be only 10 WGPs, but It would also need >3GHz clockspeed to perform the same. Doesn't sound very power efficient.
If he didn't then It would be 20 WGPs and that would be a lot faster than N21 at the same clocks.
N33(20 WGP; 40 CU; 2250MHz) -> 40*256*2*2250 = 46 TFlops
Performance could be 35-40% faster at the same clocks. That's a totally different performance class.

edit: He was at least correct about 8 SIMDs per WGP, If nothing else. 8GB Vram could be also correct. As you said, he is MIA for some time.

leoneazzurro · Jul 4, 2022

TESKATLIPOKA said:
That's why I wrote comparisons for both TFLOPs and actual performance.

Who mentioned 32 WGPs? Bondrewd mentioned 30 WGPs per GCD(N31) and that N33's ALU count is identical to N21 Link. On the other hand, Greymon55 was talking about 4096 ALUs.
And here is a problem.
If Bondrewd counted the second vALU32, then It would be only 10 WGPs, but It would also need >3GHz clockspeed to perform the same. Doesn't sound very power efficient.
If he didn't then It would be 20 WGPs and that would be a lot faster than N21 at the same clocks.
N33(20 WGP; 40 CU; 2250MHz) -> 40*256*2*2250 = 46 TFlops
Performance could be 40% faster at the same clocks. That's a totally different performance class.

I think it was mentioned by some leakers, and also N31 "lost" some CU on the way (from 15360 to 12288). Not only Greymon who I think has lost a lot of credibility to me, but also other people. Also Idk if Bondrewd's information were very early and not up to date, he doesn't write on B3D since a lot of time. But who knows, the level of secrecy about RDNA3 seems very tight.

Frenetic Pony · Jul 4, 2022

DisEnchantment said:
Looks like someone push N31 UMC patch then recalled it immediately.

[PATCH] drm/amdgpu: add umc ras functions for navi31

6 MCD * 2 UMC/MCD * 2 CH/UMC --> 384-bit bus (Each UMC is 32 bit wide and 2 Channel per UMC for GDDR6 and each MCD is 64-bit wide)

Hopefully not a jebait like this one for N21

What does UMC stand for? Universal multi core? Why would they have two different kinds of dies if the IO is on each die? All I can think of is that media engines are on UMCs and could be configured independently of graphics, meaning low-ish graphics with a lot of media engines for a hypothetical "professional" config for media types.

Where are you getting the channel widths? I don't see them in the patch notes you posted It'd make more sense if the UMCs were 192/256 bit PHYs, and the patch notes just indicate that 6 MCDs happen to go through the 2 UMCs when going out to main memory. I also wouldn't be surprised if that Navi 21 patch note wasn't some mind game, but just a hypothetical scenario they put in and never came to pass because they never had the market and/or engineering time for it.

1 MCD here could easily be around the same performance as a Navi 23 GPU (whatever that looks like in RDNA3 terms), which is a fine baseline standard to start stacking things up on. Stack 6 together and you get 2.4x the hardware Navi 21 has, and you'd probably get 450 watts without a problem with high enough clocks. Add whatever bandwidth to main memory you'd need with "UMCs" and there's your killer GPU, chopped up into roughly similar sized chiplets as Zen has, which I guess is good for reticle sizes/yields.

Karnak · Jul 4, 2022

UMC = Unified Memory Controller.

Saylick · Jul 4, 2022

The secrecy regarding RDNA 3 is indeed very tight... to summarize what we currently know or heard about RDNA 3, organized by legitimacy with my 2c in bold:
1) Info AMD have told us directly

2) Info from patches/patents

Redesigned CU, doubled FP32 SIMD per CU
New VOPD instructions to allow for dual issue wave32?
VLIW2 instructions?
384-bit memory bus split over 6 MCDs?
End-to-end data compression

3) Info from leakers/educated guesses/speculation

High clocking design (almost 3 GHz on larger dies, >3 GHz on smaller dies)
N31
- 48 WGP, 96 CUs, 12288 FP units
- ~75 FP32 TFLOPS of compute
- Single TSMC N5 GCD with die size around 400mm2
- 384-bit memory bus split over 6 MCDs using GDDR6 21 Mbps
- More Infinity Cache, each MCD with 64 MB cache for a total of 6*64 MB = 384 MB IC
N32
- 32 WGP, 64 CUs, 8192 FP units
- ~50 FP32 TFLOPS of compute
- Single TSMC N5 GCD (presumable smaller than N31, so ~280mm2?)
- 256-bit memory bus split over 4 MCDs using GDDR6 21 Mbps
- More Infinity Cache, each MCD with 64 MB cache for a total of 4*64 MB = 256 MB IC
N33
- 16 WGP, 32 CUs, 4096FP units
- ~25 FP32 TFLOPS of compute
- Single TSMC N6 die, monolithic (presumably 360mm2-400mm2?)
- 128-bit memory bus with GDDR6 21 Mbps
- 128 MB IC

Performance or power estimates are anyone's guess at this point. The performance uplift from RDNA 2 to RDNA 3 on a per-CU basis will fall between 1.0x and 2.0x, with the extremes of that range being extremely unlikely. MLID has suggested that the uplift (i.e. perf/TFLOP) is similar to what happened with Ampere, but it scales better. My interpretation of that: if Ampere doubled the FP units but only really performed like an equivalent Turing part 1.33x the FP, then RDNA 3 CU might double FP but only get the equivalent of an RDNA 2 part with 1.5x of the FP (note: 1.5x is my guess, it is not based on any evidence). Let's just run with 1.5x perf/CU uplift for now. If we assume that is true, then 96 CUs for Navi 31 should perform like 144 RDNA 2 CUs. After you factor in higher clocks (say 25% higher), then it's roughly equivalent to 180 RDNA 2 CUs, or about 2.25x faster than 6900XT which is not that far off from some Twitter leakers. If we assume similar scaling as Ampere (i.e. 1.33x) then you're looking at a simple 2x faster. I'll give AMD the benefit of the doubt with respect to their ">50% perf/W" claims just because it seemed to be true when they gave the same claim for RDNA 2. With 2.25x the performance uplift over the 6900XT (300W TDP), a 50% perf/W uplift implies a 450W TDP for N31. If we get something closer to 60% perf/W uplift, then we're closer to 420W. Either AMD are sandbagging that perf/W claim or AMD won't really have a perf/W advantage over Lovelace, assuming that the rumors of a 4090 being 2x 3090 at 450W TDP are true.

TESKATLIPOKA · Jul 5, 2022

Saylick said:
........
Performance or power estimates are anyone's guess at this point. The performance uplift from RDNA 2 to RDNA 3 on a per-CU basis will fall between 1.0x and 2.0x, with the extremes of that range being extremely unlikely. MLID has suggested that the uplift (i.e. perf/TFLOP) is similar to what happened with Ampere, but it scales better. My interpretation of that: if Ampere doubled the FP units but only really performed like an equivalent Turing part 1.33x the FP, then RDNA 3 CU might double FP but only get the equivalent of an RDNA 2 part with 1.5x of the FP (note: 1.5x is my guess, it is not based on any evidence). Let's just run with 1.5x perf/CU uplift for now.
.....

This doesn't really add up. Peak TFLOPs should increase 4x per CU, so only 50% increase in performance is too low.
I made a table from DisEnchantment's info.

	RDNA2	RDNA3	Difference
WGP	1	1	0%
CU / WGP	2	2	0%
SIMD32 / CU	2	4	+ 100%
vALU32 / SIMD32	1	2	+ 100%
Peak TFLOPs / CU	1	4	+ 300%
Performance / CU	1	2.7-3 ?	+ 170-200% ?

What you wrote about 1.5x increase per CU could be true, If the number of SIMD32 stayed the same and only vALU32 in a SIMD32 was doubled.
If I am mistaken, then please someone correct me. Thanks

Saylick · Jul 5, 2022

TESKATLIPOKA said:
This doesn't really add up. Peak TFLOPs should increase 4x per CU, so only 50% increase in performance is too low.
I made a table from DisEnchantment's info.

RDNA2 RDNA3 Difference
WGP 1 1 0%
CU / WGP 2 2 0%
SIMD32 / CU 2 4 + 100%
vALU32 / SIMD32 1 2 + 100%
Peak TFLOPs / CU 1 4 + 300%
Performance / CU 1 2.7-3 ? + 170-200% ?

What you wrote about 1.5x increase per CU could be true, If the number of SIMD32 stayed the same and only vALU32 in a SIMD32 was doubled.
If I am mistaken, then please someone correct me. Thanks

I think the table is mostly right... In RDNA 1 and 2, there's two banks of SIMD32 ALUs in each CU. In RDNA 3, that's been doubled to four banks. I think we're in agreement here.

What I'm not in agreement with is the ratio of vALU/SIMD32, which I think is 1:1. vALU32 is the same as SIMD32 if I'm not mistaken, i.e. 32-wide vector math is the same as SIMD32.

Peak TFLOPs increases by about 2.5x per CU, which comes from 2x ALUs + 25% more clocks. Then, as a whole, N31 has 20% more CUs than N21. In total, that's a increase of 3x in peak TFLOPs. Of course, performance of the CU won't scale linearly. 75 TFLOPs of RDNA 3 is roughly as performant as 56 TFLOPs of RDNA 2 (assuming 1.5x perf uplift per CU).

TESKATLIPOKA · Jul 5, 2022

Saylick said:
I think the table is mostly right... In RDNA 1 and 2, there's two banks of SIMD32 ALUs in each CU. In RDNA 3, that's been doubled to four banks. I think we're in agreement here.

What I'm not in agreement with is the ratio of vALU/SIMD32, which I think is 1:1. vALU32 is the same as SIMD32 if I'm not mistaken, i.e. 32-wide vector math is the same as SIMD32.

Peak TFLOPs increases by about 2.5x per CU, which comes from 2x ALUs + 25% more clocks. Then, as a whole, N31 has 20% more CUs than N21. In total, that's a increase of 3x in peak TFLOPs. Of course, performance of the CU won't scale linearly. 75 TFLOPs of RDNA 3 is roughly as performant as 56 TFLOPs of RDNA 2 (assuming 1.5x perf uplift per CU).

1. DisEnchantment mentioned a second vALU32 line in a SIMD.
2. Why would RDNA3 CU provide only 50% increase? From DisEnchantment's post It looks like in RDNA3 they combined 2 CUs into 1, at least that's what he thinks. Then there is a second vALU line in SIMD32 on top of that.
3. I would disregard any increase in frequency for now and compare only at ISO frequency. We don't know If 25% increase in clokspeed is correct or not.

edit: What I found is that RDNA2 SIMD32 contains 32 ALUs(shaders). So in my understanding RDNA3 SIMD32 should have 2x 32 ALUs (shaders), If there is a second vALU32 line. Per CU It would mean 256 ALUs and 512 ALUs per WGP.

edit: Hopefully, It's not just a jebait. ADA looks powerful. A lot higher clocks, FP32+INT32 is supposedly separate now and 144SM.

DisEnchantment · Jul 5, 2022

TESKATLIPOKA said:
1. DisEnchantment mentioned a second vALU32 line in a SIMD.
2. Why would RDNA3 CU provide only 50% increase? From DisEnchantment's post It looks like in RDNA3 they combined 2 CUs into 1, at least that's what he thinks. Then there is a second vALU line in SIMD32 on top of that.
3. I would disregard any increase in frequency for now and compare only at ISO frequency. We don't know If 25% increase in clokspeed is correct or not.

edit: What I found is that RDNA2 SIMD32 contains 32 ALUs(shaders). So in my understanding RDNA3 SIMD32 should have 2x 32 ALUs (shaders), If there is a second vALU32 line. Per CU It would mean 256 ALUs and 512 ALUs per WGP.

In case you are wondering further
RDNA2 CU has
2x SIMD32, inside each SIMD32 there is 32 Lane VALU, (or 32 ALUs ), 1x scalar ALU (which can produce multiple ops per cycle), 1x 2 lane DP VALU, 1x 8 lane Trans VALU

RDNA3 CU has
4x SIMD32 (assumed SIMD32 based on LLVM patches), Inside each SIMD32 there are 2x 32 lane VALUs (from Patents and LLVM Patches), 1x scalar, 1x 2lane DPFP VALU, not sure about trans VALU

However it could be a also be a big jebait, because it is using the value of the NUM_SIMD_PER_CU defined by Navi10, which is 2

https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c#L41

#include "navi10_enum.h"

NUM_SIMD_PER_CU = 2

#include "soc21_enum.h"

NUM_SIMD_PER_CU = 4

Not sure what is happening

leoneazzurro · Jul 5, 2022

I think an increase of the capability of the single CU/WGP is likely, as over a certain point increasing the number of WGPs/SE will get diminishing returns in terms of area used and will increase complexity of the compute die. We already have seen it work on Ampere, Ada will likely improve on that, so why should not it work on AMD's side. Moreover, if there is no BVH trasversal hardware added, the increase in compute capability can speed up this tray tracing task greatly.

Glo. · Jul 5, 2022

TESKATLIPOKA said:
This doesn't really add up. Peak TFLOPs should increase 4x per CU, so only 50% increase in performance is too low.
I made a table from DisEnchantment's info.

RDNA2 RDNA3 Difference
WGP 1 1 0%
CU / WGP 2 2 0%
SIMD32 / CU 2 4 + 100%
vALU32 / SIMD32 1 2 + 100%
Peak TFLOPs / CU 1 4 + 300%
Performance / CU 1 2.7-3 ? + 170-200% ?

What you wrote about 1.5x increase per CU could be true, If the number of SIMD32 stayed the same and only vALU32 in a SIMD32 was doubled.
If I am mistaken, then please someone correct me. Thanks

If AMD doubled resources for FP32 per CU then there is zero reason to believe that they will only achieve 50% of scalability.

It should be at least 80%, since the DATA POOL is in tact, per FP32 unit, per each SIMD, per each CU and per each WGP. In some cases, RDNA3 appears to have more resources per particular units than RDNA2.

There is no reason to believe that FP32 units achieved only 50%, when they are true ALUs, and not a software solution like Nvidia's Ampere FP32 execution on INT32 units.

TESKATLIPOKA · Jul 5, 2022

Glo. said:
If AMD doubled resources for FP32 per CU then there is zero reason to believe that they will only achieve 50% of scalability.

It should be at least 80%, since the DATA POOL is in tact, per FP32 unit, per each SIMD, per each CU and per each WGP. In some cases, RDNA3 appears to have more resources per particular units than RDNA2.

There is no reason to believe that FP32 units achieved only 50%, when they are true ALUs, and not a software solution like Nvidia's Ampere FP32 execution on INT32 units.

Are you talking about 2x SIMD32 per CU or the second vALU32 in SIMD32?
The best thing would be, If both of them are true for RDNA3.

edit: Ok, you most likely meant the second vALU32 in SIMD32, but then CU should be 4x FP32 per CU.

Glo. · Jul 5, 2022

TESKATLIPOKA said:
Are you talking about 2x SIMD32 per CU or the second vALU32 in SIMD32?
The best thing would be, If both of them are true for RDNA3.

edit: Ok, you most likely meant the second vALU32 in SIMD32, but then CU should be 4x FP32 per CU.

What I meant was caches, etc per ALU/CU/WGP.

Those are the resources that feed the pipelines. So far I have not seen anything that would created a bottleneck for throughput that would limit scalability of doubled ALU count per WGP/CU to just 50%.

Heck, even 80% appears to be... pessimistic.

Glo. · Jul 5, 2022

TESKATLIPOKA said:
Now I have to wonder If Phoenix(Point) APU has 6 WGPs or only 3 WGPs. I hope It's the former.

Phoenix has 6 WGPs, 1536 ALUs.

So technically, if there is Infinity Cache - we are looking at at least RTX 2060 desktop performance.

TESKATLIPOKA · Jul 5, 2022

Glo. said:
Phoenix has 6 WGPs, 1536 ALUs.

So technically, if there is Infinity Cache - we are looking at at least RTX 2060 desktop performance.

6 WGPs(12 CUs) with 4x SIMD32 per CU and 2x vALU32 per SIMD32 would be 3072 ALUs or 512 ALUs per WGP. Just check out the last 2-3 pages in this thread.
The question is, what is true.
Does RDNA3 have 2x more SIMD32 per CU or does RDNA3 have 2x more vALU32 per SIMD32, or both are true and theoretical FP32 thoughtput is 4x better.

Aapje · Jul 5, 2022

TESKATLIPOKA said:
or both are true and theoretical FP32 thoughtput is 4x better.

That seems a little optimistic.

Saylick · Jul 5, 2022

TESKATLIPOKA said:
6 WGPs(12 CUs) with 4x SIMD32 per CU and 2x vALU32 per SIMD32 would be 3072 ALUs or 512 ALUs per WGP. Just check out the last 2-3 pages in this thread.
The question is, what is true.
Does RDNA3 have 2x more SIMD32 per CU or does RDNA3 have 2x more vALU32 per SIMD32, or both are true and theoretical FP32 thoughtput is 4x better.

Yeah, I think it's one or the other, not both.

Aapje said:
That seems a little optimistic.

This.

Glo. said:
What I meant was caches, etc per ALU/CU/WGP.

Those are the resources that feed the pipelines. So far I have not seen anything that would created a bottleneck for throughput that would limit scalability of doubled ALU count per WGP/CU to just 50%.

Heck, even 80% appears to be... pessimistic.

I hope you're right, at least for the sake of competition.

DisEnchantment · Jul 5, 2022

TESKATLIPOKA said:
6 WGPs(12 CUs) with 4x SIMD32 per CU and 2x vALU32 per SIMD32 would be 3072 ALUs or 512 ALUs per WGP. Just check out the last 2-3 pages in this thread.
The question is, what is true.
Does RDNA3 have 2x more SIMD32 per CU or does RDNA3 have 2x more vALU32 per SIMD32, or both are true and theoretical FP32 thoughtput is 4x better.

I think 2x 32 lane VALU per SIMD32 is quite certain, people don't play with LLVM. It is incredibly hard to optimize it correctly, touching something else is just asking for trouble. It could produce some random code. Which actually happened for Blender, AMD's HIP compiler produces code crashing the compute kernel, it was fixed some months ago.

Regarding 4x SIMD32 per CU, I believe it is possible. I think the enum value of NUM_SIMD_PER_CU = 4 is actually correct.
The gfx_v11.c code is just copied from Navi10. So it is not updated yet in so many places. So they may indeed use the actual SIMD count from soc21, fingers crossed but no reason to think otherwise
Golden registers are not there, RAS is not there, firmware is not there and not uploaded in LVFS yet. Still long way to go.
I think they are just doing final bringup and have not got everything done. After that they will push the rest of the code.
Just a reminder, driver for every new ASIC is get developed initially on emulator as usual.
Once the new merge Window opens up with Linux 5.20, expect another burst of changes. Couple of weeks from now.

Another thing I have noticed is that N31 has no XGMI, no support added uptil now. N21 used up around a good chunk of mm2 for XGMI alone and only used in Radeon Pro W6800X for Apple.

Regarding MCD, I think 32MiB is just too small, like 35mm2 per MCD including UMC+PHY. Quite weird to have so small chiplets, Packaging overhead could be a thing,
It should be at least 64MiB each, it would take the MCD to around 60-65mm2 in size. Otherwise might just make a single MCD of 64 MiB instead of 2x 32 MiB chiplets
Also would make N32 carry more cache than N21 (i.e. 256 MiB and not 128 MiB if it were 32 MiB per MCD) considering it is so much more potent and needs the BW.

Question Speculation: RDNA3 + CDNA2 Architectures Thread

Platinum Member

Platinum Member

Golden Member

Golden Member

Platinum Member

Platinum Member

Golden Member

Golden Member

Platinum Member

Golden Member

Senior member

Senior member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Golden Member

Golden Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Golden Member

Diamond Member

Golden Member