Question Speculation: RDNA3 + CDNA2 Architectures Thread

Page 43 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Platinum Member
Oct 16, 2019
2,703
6,405
146

coercitiv

Diamond Member
Jan 24, 2014
6,400
12,849
136
That would make second hand 6700XT's at $300 a hard sell unless you need a card now...
I think people underestimate what's about to happen: between the economic earthquake and the crypto tsunami, video cards of all generations will be bouncing around like debris in a hurricane.

In case the metaphors weren't enough, there's a perfect storm coming.
 

maddie

Diamond Member
Jul 18, 2010
4,787
4,771
136
I think people underestimate what's about to happen: between the economic earthquake and the crypto tsunami, video cards of all generations will be bouncing around like debris in a hurricane.

In case the metaphors weren't enough, there's a perfect storm coming.
As someone who lives by the Hurricane belt, your metaphor reminds me of another one. The calm before the storm.
 
Reactions: Tlh97 and Kepler_L2

Stuka87

Diamond Member
Dec 10, 2010
6,240
2,559
136
As someone who lives by the Hurricane belt, your metaphor reminds me of another one. The calm before the storm.

I think we may be in the eye on the storm. We just had crazy high prices and scarcity. After the eye passes over, everything is going to flip directions, and there will be a glut of cards and lower prices.

Or at least, thats what I hope happens
 
Reactions: maddie

GodisanAtheist

Diamond Member
Nov 16, 2006
7,063
7,489
136
I think people underestimate what's about to happen: between the economic earthquake and the crypto tsunami, video cards of all generations will be bouncing around like debris in a hurricane.

In case the metaphors weren't enough, there's a perfect storm coming.

- My finger has been hovering above the buy button on second hand 6700XTs as they approach the $350 price point, which would still represent a doubling of performance for me... but now I'm thinking even $250 might be too much for one...

I think it makes sense to play the gap between the new generation's announcement and the actual launch of the cards (which given the hunger in the market for affordable cards will likely still be a huge crapshow in terms of supply meeting demand).

If the 7600XT comes in at ~$379 and provides 6900XT performance, its going to put a real hard cap on how much RDNA2 can resell for across the stack.
 

beginner99

Diamond Member
Jun 2, 2009
5,223
1,598
136
I think people underestimate what's about to happen: between the economic earthquake and the crypto tsunami, video cards of all generations will be bouncing around like debris in a hurricane.

In case the metaphors weren't enough, there's a perfect storm coming.

Yeah I think demand will fall sharply due to less money from inflation, then a recession and people losing jobs...At the same time costs go up so I can't really see an N33 for $400 due to rising costs and sinking value of the dollar.

Then they will have to stop raising rates or even lower them again, start printing again. Once you start the cycle of printing and inflation there is no way out really until it all implodes in about 10 years and everyone runs to crypto.
 

maddie

Diamond Member
Jul 18, 2010
4,787
4,771
136
That's exactly why N33 won't be that cheap because they are going to continue to sell the RDNA 2 Refresh below it.
If N33 is an 8 GB card, as seems likely, do you think they will be selling 6700XT 12GB refresh cards below it?

Can they use 6GB instead or do we get a NVIDIA scenario with lower cards having more memory than higher cards?

Will RDNA2 refresh cards offer the same performance & price as today with no perf/$ inprovement?
 

Saylick

Diamond Member
Sep 10, 2012
3,385
7,151
136
Seeing that N33 is on TSMC N6, which is a cheaper node than N7, and it has roughly the same performance as the 6900XT at a smaller die size (~400mm2 vs 520mm2) and with less RAM, I think it's fair to say that N33 will be positioned at a price-point much lower than the 6900XT (and 6800XT for that matter). It would not make sense for AMD to try to inflate the price of N33 just so that they can continue selling a more costly (for them) RDNA 2 refresh. If anything, it would be in AMD's best interest to phase out RDNA 2 entirely and replace it with N34 to flesh out the lower-end of the performance stack.
 

jpiniero

Lifer
Oct 1, 2010
14,841
5,456
136
Seeing that N33 is on TSMC N6, which is a cheaper node than N7

No reason to think N6 wafers are cheaper than N7. Cheaper per transistor, yeah, but the N33 die is probably bigger than N22. Which of course they just released a product at $549 for the top model.

Doesn't look like AMD will do a new model any time soon below N33 because they are afraid of The Flood. But N33 shouldn't be affected since it should be faster than the 6900 XT.
 

Timorous

Golden Member
Oct 27, 2008
1,727
3,152
136
No reason to think N6 wafers are cheaper than N7. Cheaper per transistor, yeah, but the N33 die is probably bigger than N22. Which of course they just released a product at $549 for the top model.

Doesn't look like AMD will do a new model any time soon below N33 because they are afraid of The Flood. But N33 shouldn't be affected since it should be faster than the 6900 XT.

They can do a cut N33 if need be like the 5600XT was a cut N10
 

Karnak

Senior member
Jan 5, 2017
399
767
136
Are we still on that "N33 is N6" train?

There were slides and there was an official roadmap. According to all of that (and even Wang mentioning this one while talking) RDNA3 is based on TSMC N5. And on top of that it's a "chiplet architecture".

So if N33 = RDNA3 it's a) not monolithic but rather based on their "chiplet architecture" and b) it's based on TSMC N5 and not N6.

Don't know why leakers keep saying it's monolithic and N6. Since the FAD we know that this is not the case.
 

Saylick

Diamond Member
Sep 10, 2012
3,385
7,151
136
Are we still on that "N33 is N6" train?

There were slides and there was an official roadmap. According to all of that (and even Wang mentioning this one while talking) RDNA3 is based on TSMC N5. And on top of that it's a "chiplet architecture".

So if N33 = RDNA3 it's a) not monolithic but rather based on their "chiplet architecture" and b) it's based on TSMC N5 and not N6.

Don't know why leakers keep saying it's monolithic and N6. Since the FAD we know that this is not the case.
I don't think David Wang calling RDNA 3 a "chiplet architecture" means that all forms of RDNA 3 will be chiplet based. I mean, we'll see RDNA 3 in monolithic designs, e.g. mobile APUs.
 
Reactions: Aapje and Kepler_L2

Saylick

Diamond Member
Sep 10, 2012
3,385
7,151
136
But Phoenix Point uses chiplet tech as well.
Sure, even if Phoenix Point is RDNA 3 and is a chiplet-based processor, that doesn't change the fact that RDNA 3 can be in a monolithic design. Zen 2 and Zen 3 are arguably "chiplet architectures" yet we obviously saw them in monolithic designs. For the same reason, maybe Phoenix Point uses RDNA 3 in a chiplet fashion, but come some point down the road we see a budget monolithic APU that uses RDNA 3 when RDNA 4 is coming out. Point is, there's nothing that requires RDNA 3 to be in a chiplet-based form.
 

Frenetic Pony

Senior member
May 1, 2012
218
179
116
Are we still on that "N33 is N6" train?

There were slides and there was an official roadmap. According to all of that (and even Wang mentioning this one while talking) RDNA3 is based on TSMC N5. And on top of that it's a "chiplet architecture".

So if N33 = RDNA3 it's a) not monolithic but rather based on their "chiplet architecture" and b) it's based on TSMC N5 and not N6.

Don't know why leakers keep saying it's monolithic and N6. Since the FAD we know that this is not the case.

Really, that "leak", if it was one, from LinkedIn could just be that engineer only working on SRAM or IO chiplets, but not on the third RDNA3 compute chip. If (as an example) you work on N31, N32, and SRAM chiplets, then leave before N33 really gets going how do you put that on your resume? Well maybe you put down exactly what we saw.
 

moinmoin

Diamond Member
Jun 1, 2017
4,994
7,765
136
Sure, even if Phoenix Point is RDNA 3 and is a chiplet-based processor, that doesn't change the fact that RDNA 3 can be in a monolithic design. Zen 2 and Zen 3 are arguably "chiplet architectures" yet we obviously saw them in monolithic designs. For the same reason, maybe Phoenix Point uses RDNA 3 in a chiplet fashion, but come some point down the road we see a budget monolithic APU that uses RDNA 3 when RDNA 4 is coming out. Point is, there's nothing that requires RDNA 3 to be in a chiplet-based form.
The question is whether we will see a monolithic mobile APU containing RDNA 3 anytime soon. The one we expected, Phoenix, was then announced to use chiplet tech (whatever that means in detail) at FAD, and other potential candidates aren't known yet.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,429
2,914
136
If Phoenix Point is not monolithic, then I wonder which chip contains what.
It will be interesting to see, If we will see 6 WGP IGP for Phoenix Point U or only for higher TDP models.
 
Last edited:

beginner99

Diamond Member
Jun 2, 2009
5,223
1,598
136
The question is whether we will see a monolithic mobile APU containing RDNA 3 anytime soon. The one we expected, Phoenix, was then announced to use chiplet tech (whatever that means in detail) at FAD, and other potential candidates aren't known yet.

Given that from leaks we know Zen4 IOD wlll contain a very basic iGPU, I think there isn't much room for some half-backed solution. The Zen4 iGPU should cover a ton of cases.basci graphics and video decode. Most people don't need more than that really.

APUs with chiplet tech can give the "i"GPU a lot of cache so it can work much better with the limited bandwidth. Ultimatley you would simply not need a dGPU anymore in mobile devices.
 

Leeea

Diamond Member
Apr 3, 2020
3,698
5,432
136
I think people underestimate what's about to happen: between the economic earthquake and the crypto tsunami, video cards of all generations will be bouncing around like debris in a hurricane.

In case the metaphors weren't enough, there's a perfect storm coming.
I am hoping for that.

I have no desire to buy another this gen GPU, but if next gen can double the performance of my current GPU, I am hoping to buy.

Thing is, I am hoping next gen gets released, dives a bit, and I can get that performance $900 USD-ish. I am willing to wait several months after launch, my current GPU will be good enough for a while.

(yea, I know I am in fantasy land)
 

DisEnchantment

Golden Member
Mar 3, 2017
1,687
6,243
136
Also found the commit indicating RDNA3 has 1/2 DPFP of RDNA2 (i.e. 1/16 in RDNA2 vs 1/32 in DRNA3) throughput. which could support the idea of 2x FP throughput per CU

How does this work. This makes no sense to me.
I found a more precise answer to your question. It seems that RDNA has a separate 2 lane DPFP VALU per SIMD and 32 lane (hence SIMD32) VALU, thus FP32:FP64 ratio is 2:32 (i.e. 1:16).
This 2 lane DPFP VALU can run in parallel with the 32 lane FP32 VALU.

On RDNA3, there is a second 32 lane VALU and that is why they now have 1:32 DPFP throughput. ( i.e. 2 : 2x32 )

So indeed, AMD is going to project 2x FP32 throughput per SIMD32 in RDNA3.
So applications which rely a lot on vector f32_mul/add or f32_mul_mul/mul_add/add_add/add_mul + accumulate are going to get a great boost. i.e rendering apps.
Games would benefit opportunistically when the kernel instructions can be reorganized so that operand dependency order could be such that 2x FP32 vector ops can be done per cycle per SIMD32 including FMA types ops. VGPR has some new swizzling modes to support the operand gather to deliver the operands needed for the VOPD dual issue ops.
Additionally, we can surmise that VOPD does not work in wave64 is because RDNA executes wave64 in back to back wave32 on same SIMD32 not across two SIMD32s which is basically what the patch is doing.

Continuing on...
I find it very intriguing that AMD is going back to 4x SIMD per CU, back to GCN, albeit this time it is 4xSIMD32 per CU on RDNA3 vs 4xSIMD16 per CU on GCN
On GCN 16x SIMD16 (4CUs) share the same frontend and now it comes full circle, 8x SIMD32 (2CUs / 1WGP) share same frontend.

VGPR per SIMD32 seems to be the same as per LLVM though.
I assumed they started profiling and found that allocating 1x frontend per 4 SIMD32 in RDNA seems excessive.
Another is that, vector L0 of each CU in a WGP contain much duplicated data that it makes more sense to make all SIMDs in a WGP share the same L0.
We can recollect that L0 Instruction cache is already shared across WGP, but vector L0 is not.

In short I believe in RDNA3, they merged 2 CUs in one, and combine their L0s and LDSs. In WGP mode LDS is mergeable and shareable across all CUs anyway (like RDNA1/2) and potentially L0 is also shared across the WGP. With this they can reduce data duplication across CUs, increase cache size and hit rate, reduce duplication of frontend by 1/2 and address the programming quirks mentioned in the optimization guide for RDNA (see YT video by Lou Kramer and also in the RDNA whitepaper)

At CU level,
  • 4x SIMD32 per CU, each SIMD32 has 2x FP32 VALU, 4x the FP32 throughput of RDNA2 CU theoretically, should be consistent in rendering loads
  • L0 in each CU is doubled (shared by all SIMD32s like in RDNA1/2)
  • LDS is doubled (shared by all SIMDs like in RDNA1/2)
At WGP level
  • 8x SIMD32 in one WGP
  • L0 is now 4x in size and accessible by the entire WGP (in RDNA1/2 it is not, but combining them would remove data duplication improving hit rate due to more space)
  • LDS is now 4x (accessible by the entire WGP like in RDNA1/2)
So each RDNA3 CU is quite a fat CU. So GL1 has lesser clients. L0 has more clients but is fatter and therefore more hit rate.

Memory model for RDNA3 is still not fully updated in LLVM (as per dev comments), it is the one to watch out for.

Going by the looks of how AMD will deliver performance in Zen4, I think the same approach they will take on RDNA3. Narrower but much faster clocks.
 

Kepler_L2

Senior member
Sep 6, 2020
466
1,910
106
I found a more precise answer to your question. It seems that RDNA has a separate 2 lane DPFP VALU per SIMD and 32 lane (hence SIMD32) VALU, thus FP32:FP64 ratio is 2:32 (i.e. 1:16).
This 2 lane DPFP VALU can run in parallel with the 32 lane FP32 VALU.

On RDNA3, there is a second 32 lane VALU and that is why they now have 1:32 DPFP throughput. ( i.e. 2 : 2x32 )
View attachment 63983
So indeed, AMD is going to project 2x FP32 throughput per SIMD32 in RDNA3.
So applications which rely a lot on vector f32_mul/add or f32_mul_mul/mul_add/add_add/add_mul + accumulate are going to get a great boost. i.e rendering apps.
Games would benefit opportunistically when the kernel instructions can be reorganized so that operand dependency order could be such that 2x FP32 vector ops can be done per cycle per SIMD32 including FMA types ops. VGPR has some new swizzling modes to support the operand gather to deliver the operands needed for the VOPD dual issue ops.
Additionally, we can surmise that VOPD does not work in wave64 is because RDNA executes wave64 in back to back wave32 on same SIMD32 not across two SIMD32s which is basically what the patch is doing.

Continuing on...
I find it very intriguing that AMD is going back to 4x SIMD per CU, back to GCN, albeit this time it is 4xSIMD32 per CU on RDNA3 vs 4xSIMD16 per CU on GCN
On GCN 16x SIMD16 (4CUs) share the same frontend and now it comes full circle, 8x SIMD32 (2CUs / 1WGP) share same frontend.

VGPR per SIMD32 seems to be the same as per LLVM though.
I assumed they started profiling and found that allocating 1x frontend per 4 SIMD32 in RDNA seems excessive.
Another is that, vector L0 of each CU in a WGP contain much duplicated data that it makes more sense to make all SIMDs in a WGP share the same L0.
We can recollect that L0 Instruction cache is already shared across WGP, but vector L0 is not.

In short I believe in RDNA3, they merged 2 CUs in one, and combine their L0s and LDSs. In WGP mode LDS is mergeable and shareable across all CUs anyway (like RDNA1/2) and potentially L0 is also shared across the WGP. With this they can reduce data duplication across CUs, increase cache size and hit rate, reduce duplication of frontend by 1/2 and address the programming quirks mentioned in the optimization guide for RDNA (see YT video by Lou Kramer and also in the RDNA whitepaper)

At CU level,
  • 4x SIMD32 per CU, each SIMD32 has 2x FP32 VALU, 4x the FP32 throughput of RDNA2 CU theoretically, should be consistent in rendering loads
  • L0 in each CU is doubled (shared by all SIMD32s like in RDNA1/2)
  • LDS is doubled (shared by all SIMDs like in RDNA1/2)
At WGP level
  • 8x SIMD32 in one WGP
  • L0 is now 4x in size and accessible by the entire WGP (in RDNA1/2 it is not, but combining them would remove data duplication improving hit rate due to more space)
  • LDS is now 4x (accessible by the entire WGP like in RDNA1/2)
So each RDNA3 CU is quite a fat CU. So GL1 has lesser clients. L0 has more clients but is fatter and therefore more hit rate.

Memory model for RDNA3 is still not fully updated in LLVM (as per dev comments), it is the one to watch out for.

Going by the looks of how AMD will deliver performance in Zen4, I think the same approach they will take on RDNA3. Narrower but much faster clocks.
The Dual Vector ALU patent suggests that wave64 is split in two SIMD32s to execute in a single cycle.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,687
6,243
136
The Dual Vector ALU patent suggests that wave64 is split in two SIMD32s to execute in a single cycle.
I think you meant wave64 is split in two 32 lane VALUs on same SIMD like shown (not two SIMD32 from said patent)

Currently wave64 is executed back to back on same SIMD (from manual), of course this may change.
To handle wave64 instructions, the wave controller issues and executes two wave32 instructions, each operating on half of the work-items of the wave64 instruction. The default way to handle a wave64 instruction is simply to issue and execute the upper and lower halves of each instruction back-to-back – conceptually slicing every instruction horizontally.
For now we have seen VOPD patch only for wave32, but wave64 should be doable for some ops (e.g. those that do not require 3 operands like VOP3 ops)
wave64 is not supported at the moment for VOPD as can be seen from below listing file.
v_dual_mul_f32 v0, v0, v2 :: v_dual_mul_f32 v1, v1, v3
// GFX11: encoding: [0x00,0x05,0xc6,0xc8,0x01,0x07,0x00,0x00]
// W64-ERR: :[[@LINE-2]]:{{[0-9]+}}: error

v_dual_fmaak_f32 v122, s74, v161, 2.741 :: v_dual_mov_b32 v247, 2
// GFX11: encoding: [0x4a,0x42,0x51,0xc8,0x82,0x00,0xf6,0x7a,0x8b,0x6c,0x2f,0x40]
// W64-ERR: :[[@LINE-2]]:{{[0-9]+}}: error
With wave32 they can mix and match ops where all operands can be fed by 4 VPGR banks, immediates and constant operands from SGPRs , (see quote from manual)
While the vector ALUs primarily read data from the high-bandwidth vGPRs, the scalar register file now supplies up to two broadcast operands per clock to every lane.
LLVM patches mentioned VPGR count is still 256, which means they are still limited to 4 banks only which means 4 vector operands from the VGPR per cycle (not enough for complete VOP3 dual issue ).
But wave64 is not so critical for games anyway.

I suspect 6 VGPR banks will come from RDNA4 and beyond and this can produce 6 vector operands per cycle to fully support real dual issue for all VOP3 ops. Or if each VPGR bank can produce 2 operands per cycle in RDNA3? probably not true as can be seen from LLVM patches.
Another possibility is 64 wide VGPR bank, like the DC GPUs but unlikely. It will be very fat.
 
Last edited:
Reactions: lobz

leoneazzurro

Golden Member
Jul 26, 2016
1,010
1,608
136
Now it would be interesting to know exactly how N31-N33 are configured. We saw some numbers leaked for the total number of units of RDNA3's SKus but this double lane FP32 seems to throw all rumors off. That is, most rumors counted for FP32 throughput calculations the same number of FP32 unit per CU as RDNA 1/2. But if this is really doubled, well... N31 would be a 24576 FP32 unit GPU, with clocks supposedly higher than RDNA2. That would be a 122TF@2500MHz card. Granted, as these units would do also INT and other operations their efficiency would not improve -in that regard- respect to RDNA2. It would also explain how a N33 is supposed to be as fast or faster than a N21 even having less CUs and less bandwidth while not having supposedly way higher clock speeds (sorta similar to the 2080Ti to 3080 jump).
 

DisEnchantment

Golden Member
Mar 3, 2017
1,687
6,243
136
Now it would be interesting to know exactly how N31-N33 are configured. We saw some numbers leaked for the total number of units of RDNA3's SKus but this double lane FP32 seems to throw all rumors off. That is, most rumors counted for FP32 throughput calculations the same number of FP32 unit per CU as RDNA 1/2. But if this is really doubled, well... N31 would be a 24576 FP32 unit GPU, with clocks supposedly higher than RDNA2. That would be a 122TF@2500MHz card. Granted, as these units would do also INT and other operations their efficiency would not improve -in that regard- respect to RDNA2. It would also explain how a N33 is supposed to be as fast or faster than a N21 even having less CUs and less bandwidth while not having supposedly way higher clock speeds (sorta similar to the 2080Ti to 3080 jump).
At the moment VGPR cannot produce enough operands per cycle for all out dual FMA in all cases.
But you can see the compiler devs trying to weave some magic to optimize the wave execution, such that they reorder ops to make sure v_dual_xxx gets higher chance of scheduling, lets see what software optimization can do.
But I think it should be at least 30%-35% more throughput than single 32 lane VALU to be worth the effort.
But I think we are far away from the LLVM patches being done, Memory model is not even there yet. So they are keeping juicy bits as late as possible.
You can see they don't have test cases where v_dual_xxx can be tested, except image and memory instructions, They need real wavefront tests not only instruction tests.
 
Last edited:

leoneazzurro

Golden Member
Jul 26, 2016
1,010
1,608
136
At the moment VGPR cannot produce enough operands per cycle for all out dual FMA in all cases.
But you can see the compiler devs trying to weave some magic to optimize the wave execution, such that they reorder ops to make sure v_dual_xxx gets higher chance of scheduling, lets see what software optimization can do.
But I think it should be at least 30%-35% more throughput than single 32 lane VALU to be worth the effort.
But I think we are far away from the LLVM patches being done, Memory model is not even there yet. So they are keeping juicy bits as late as possible.
You can see they don't have test cases where v_dual_xxx can be tested, except image and memory instructions, They need real wavefront tests not only instruction tests.

Yes, that is why I said that efficiency in term of real throughput / number of FP32 units will likely decrease. But total throughput per CU would be significantly higher.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |