Question Speculation: RDNA3 + CDNA2 Architectures Thread

Page 86 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Platinum Member
Oct 16, 2019
2,702
6,405
146

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
If the angstronomics is correct, whatever AMD did to double the number of shaders was very cheap in terms of silicon as Navi 33 die is smaller than navi 23 yet they have doubled the number of shaders.

Combined with the very minor transistor density improvement of TSMC 6nm, I think this architecture is going to be like amphere were compute goes up but gaming performance is going to be quite modest relative to the compute increase.

However power consumption will not increase that much and it will be pointless for the most part as I think AMD has also not increased the pipeline that much to accommodate this wider architecture.

I think Navi 33 is going to be 1.2 to 1.3x a navi 23 but at 120watts instead of 165 watts.
You are underestimating N33 a bit too much.
Just by having clocks >3GHz would mean >20% better performance.
Ampere gained ~30% better performance by moving from 64 FP32 + 64 INT32 to 64FP32 + 64 INT32/FP32.
RDNA3 should be moving from 64 FP32/INT32 to 128 FP32/INT32.
Unlike Ampere there are 2x more shaders, so I expect higher gains than what Ampere got, I think 50% shouldn't be too unreasonable.
120W is too low, but It should be under 160-170W or something like that according to Bondrewd.
 
Last edited:

Glo.

Diamond Member
Apr 25, 2015
5,761
4,666
136
You are underestimating N33 a bit too much.
Just by having clocks >3GHz would mean >20% better performance.
Ampere gained ~30% better performance by moving from 64 FP32 + 64 INT32 to 64FP32 + 64 INT32/FP32.
RDNA3 should be moving from 64 FP32/INT32 to 128 FP32/INT32.
Unlike Ampere there are 2x more shaders, so I expect higher gains than what Ampere got, I think 50% shouldn't be too unreasonable.
120W is too low, but It should be under 160-170W or something like that according to Bondrewd.
The gain should be nearly linear.

At least for N31 and 32. N33 and PHX should get 80% scaling from increased amount of FP32 per CU.
 
Reactions: Leeea

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
The gain should be nearly linear.

At least for N31 and 32. N33 and PHX should get 80% scaling from increased amount of FP32 per CU.
RDNA3 WGP is slightly smaller than RDNA2 WGP on the same process node according to Angstronomics, so having 80-95% increase in performance sounds too unrealistic.
Rumored 2X higher performance for N31 compared to N21 would also be too low, considering 48WGPs would be equivalent to >90 WGPs RDNA2 and N21 has 40WGPs. Then you also have a lot higher clockspeed.
 
Last edited:

lixlax

Member
Nov 6, 2014
184
158
116
You are underestimating N33 a bit too much.
Just by having clocks >3GHz would mean >20% better performance.
Ampere gained ~30% better performance by moving from 64 FP32 + 64 INT32 to 64FP32 + 64 INT32/FP32.
RDNA3 should be moving from 64 FP32/INT32 to 128 FP32/INT32.
Unlike Ampere there are 2x more shaders, so I expect higher gains than what Ampere got, I think 50% shouldn't be too unreasonable.
120W is too low, but It should be under 160-170W or something like that according to Bondrewd.
If AMD is actually doing what Nvidia did with Ampere (aka shared fp/int units) then:
100% (5120 SPs in N21) x1,2 (total unit count increase) x1,3 (gain that Nvidia got) x1,3 (clock speed increase, no idea what the actual clocks are going to be) =202,8%.
Very close to the 2x increase over Navi 21 rumoured lately.
 
Reactions: Leeea

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
If AMD is actually doing what Nvidia did with Ampere (aka shared fp/int units) then:
100% (5120 SPs in N21) x1,2 (total unit count increase) x1,3 (gain that Nvidia got) x1,3 (clock speed increase, no idea what the actual clocks are going to be) =202,8%.
Very close to the 2x increase over Navi 21 rumoured lately.
AMD is not doing that, RDNA shared FP32/INT32 from the beginning. Turing had FP32 and INT32 separate, then Ampere added FP32 functionality to INT32 cores.
 

Saylick

Diamond Member
Sep 10, 2012
3,385
7,151
136
AMD is not doing that, RDNA shared FP32/INT32 from the beginning. Turing had FP32 and INT32 separate, then Ampere added FP32 functionality to INT32 cores.
Are we 100% positive that is the case?

Chips and Cheese drew up a diagram for VOPD instructions and it looks like only one of the SIMD banks can do FP+INT:


Now, I understand that this could just be a limitation of the VOPD instruction. It is possible that both SIMD banks are FP+INT and if you wanted to run INT+INT instructions, it would simply take 2 instructions. Meanwhile, if you wanted to run FP+FP or FP+INT, it would just take a single VOPD instruction.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
Are we 100% positive that is the case?

Chips and Cheese drew up a diagram for VOPD instructions and it looks like only one of the SIMD banks can do FP+INT:
View attachment 69859

Now, I understand that this could just be a limitation of the VOPD instruction. It is possible that both SIMD banks are FP+INT and if you wanted to run INT+INT instructions, it would simply take 2 instructions. Meanwhile, if you wanted to run FP+FP or FP+INT, it would just take a single VOPD instruction.
I assumed that they will keep the functionality, but It's quite possible I am wrong.
If they changed It, that would explain the small size of CU(WGP) and you don't really need the ratio to be 1:1 as Nvidia explained during Ampere's launch.
 
Reactions: Leeea

DisEnchantment

Golden Member
Mar 3, 2017
1,687
6,235
136
Are we 100% positive that is the case?

Chips and Cheese drew up a diagram for VOPD instructions and it looks like only one of the SIMD banks can do FP+INT:
View attachment 69859

Now, I understand that this could just be a limitation of the VOPD instruction. It is possible that both SIMD banks are FP+INT and if you wanted to run INT+INT instructions, it would simply take 2 instructions. Meanwhile, if you wanted to run FP+FP or FP+INT, it would just take a single VOPD instruction.
RDNA3 diagram is incorrect. As per LLVM and RadeonSI/Mesa
GFX1100/1 has 1536 entries = 192K. This has 6 VGPR banks which means dual issue per cycle
GFX1103 has 1024 entries = 128K. This has 4 VGPR banks which means opportunistically dual issue when compiler can reorder the ops
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
RDNA3 diagram is incorrect. As per LLVM
GFX1100/1 has 1536 entries = 192K. This has 6 VGPR banks which means dual issue per cycle
GFX1103 has 1024 entries = 128K. This has 4 VGPR banks which means opportunistically dual issue when compiler can reorder the ops
So that Chips and Cheese diagram corresponds to N33 and Phoenix, right?
 
Reactions: Leeea

Saylick

Diamond Member
Sep 10, 2012
3,385
7,151
136
RDNA3 diagram is incorrect. As per LLVM and RadeonSI/Mesa
GFX1100/1 has 1536 entries = 192K. This has 6 VGPR banks which means dual issue per cycle
GFX1103 has 1024 entries = 128K. This has 4 VGPR banks which means opportunistically dual issue when compiler can reorder the ops
Right, I was aware of that. I just used that one because it was the only diaphragm I could find that showed the different instructions that a VOPD instruction could issue.

Regarding the 6 register banks for N31/N32, I really do expect that AMD will be able to get closer to 2x the throughput since there's pretty much no risk of bank conflicts. The design change makes a lot of sense too if you consider that RDNA's front end was already capable of issuing up to 4 instructions per clock, so the front end was very robust. Underutilized if you will.
 

Glo.

Diamond Member
Apr 25, 2015
5,761
4,666
136
RDNA3 WGP is slightly smaller than RDNA2 WGP on the same process node according to Angstronomics, so having 80-95% increase in performance sounds too unrealistic.
Rumored 2X higher performance for N31 compared to N21 would also be too low, considering 48WGPs would be equivalent to >90 WGPs RDNA2 and N21 has 40WGPs. Then you also have a lot higher clockspeed.
OREO.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,687
6,235
136
Regarding the 6 register banks for N31/N32, I really do expect that AMD will be able to get closer to 2x the throughput since there's pretty much no risk of bank conflicts. The design change makes a lot of sense too if you consider that RDNA's front end was already capable of issuing up to 4 instructions per clock, so the front end was very robust. Underutilized if you will.
The could be bottlenecked by the vector L0. But, I have seen @Kepler_L2 already mentioned doubled L0.
Fortunately for RDNA, the IFC/MALL also works very well and is going to be in its second generation.
To be seen how they address BW/Latency issues and shader occupancy, divergent memory access.

Several interesting things on N31/2 already in open source commits
  • NGG only geometry pipeline
  • OREO
  • Unified GFX ring/MES only schedule
  • 1 Cycle Wave64
  • WMMA
  • True 16bit ops/similar to mobile
  • Many new low precision vector ops, derived from CDNA2+/GFX940
  • CDNA2+ features like Architected flat scratch/packed work item ids
  • End to End DCC
 
Last edited:

Saylick

Diamond Member
Sep 10, 2012
3,385
7,151
136
The could be bottlenecked by the vector L0. But, I have seen @Kepler_L2 already mentioned doubled L0.
Fortunately for RDNA, the IFC/MALL also works very well and is going to be in its second generation.
To be seen how they address BW/Latency issues and shader occupancy, divergent memory access.

Several interesting things on N31/2 already in open source commits
  • NGG only geometry pipeline
  • OREO
  • Unified GFX ring/MES only schedule
  • 1 Cycle Wave64
  • WMMA
  • True 16bit ops/similar to mobile
  • Many new low precision vector ops, derived from CDNA2+/GFX940
  • CDNA2+ features like Architected flat scratch/packed work item ids
Don't forget end-to-end data compression as well.
 

leoneazzurro

Golden Member
Jul 26, 2016
1,010
1,605
136
Well we have another week (8 days in reality) to find out.
I think with all the changes we saw, 50% perf/W should be really sandbagging.
My personal feeling is: we'll see around +70% perf/W on the top SKU but at the same time there will be no 400+W SKU except for custom products. At 375W, that would mean exactly 2x the performance of the 6900XT.
 
Last edited:

TESKATLIPOKA

Platinum Member
May 1, 2020
2,428
2,914
136
Well we have another week (8 days in reality) to find out.
I think with all the changes we saw, 50% perf/W should be really sandbagging.
My personal feeling is we'll see around +70% perf/W on the top SKU but at the same time there will not be any 400+W SKU except for custom products. At 375W, that would mean exactly 2x the performance of the 6900XT.
N33 is on 6nm, so this one should have the worst power efficiency of them all.
 
Reactions: Leeea

Kaluan

Senior member
Jan 4, 2022
503
1,074
106
Anyone else think Angstronomics' supposed N32 leak makes little sense?

Why would they suddenly change GPU hardware allocation from N31 to N32 (and funnily enough N33 goes back to N31-style as well).

I would be quite shocked if N32 really is 30WGP and 2560 SIMD per Shader Engine, when N31 and N33 are both 2048 SIMD per SE.

Also why does everyone believe V-stacked RDNA3 won't use specialized SRAM libraries on the MCDs? Doesn't 16+32MB make more sense than 16+16MB? That's how ZenX3D does it at least. Why the departure?
 
Reactions: Tlh97 and Leeea

jpiniero

Lifer
Oct 1, 2010
14,835
5,452
136
Anyone else think Angstronomics' supposed N32 leak makes little sense?

Why would they suddenly change GPU hardware allocation from N31 to N32 (and funnily enough N33 goes back to N31-style as well).

I would be quite shocked if N32 really is 30WGP and 2560 SIMD per Shader Engine, when N31 and N33 are both 2048 SIMD per SE.

???? N33 is the one that's different, not N32. N32's GCD is probally just a cut die of N31.
 
Reactions: Tlh97 and Leeea

leoneazzurro

Golden Member
Jul 26, 2016
1,010
1,605
136
???? N33 is the one that's different, not N32. N32's GCD is probally just a cut die of N31.

N33 is different in the WGP structure (less local memory, i.e.) but N32 is, supposedly, different respect to N31 in the number of Workgroups per shader engine, according to Skyjuice's leaks.

N31->48WGP in 6 SE -> 8 WGP per SE
N32-> 30WGP in 3 SE->10 WGP per SE

And, again according to Angstronomics, N32 GCD is a different die, 2/3 of the size of N31 GCD
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |