Question Speculation: RDNA3 + CDNA2 Architectures Thread

uzzi38 · Jan 23, 2021

Man I have been dying to make this one for a while now.

First rumours for RDNA3 are here so new thread time!

Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3 is much bigger than from RDNA1 to RDNA2. We should expect many big improvements in GFX11. 🤔" / Twitter

coercitiv · Jun 30, 2022

RnR_au said:
That would make second hand 6700XT's at $300 a hard sell unless you need a card now...

I think people underestimate what's about to happen: between the economic earthquake and the crypto tsunami, video cards of all generations will be bouncing around like debris in a hurricane.

In case the metaphors weren't enough, there's a perfect storm coming.

maddie · Jun 30, 2022

coercitiv said:
I think people underestimate what's about to happen: between the economic earthquake and the crypto tsunami, video cards of all generations will be bouncing around like debris in a hurricane.

In case the metaphors weren't enough, there's a perfect storm coming.

As someone who lives by the Hurricane belt, your metaphor reminds me of another one. The calm before the storm.

Stuka87 · Jun 30, 2022

maddie said:
As someone who lives by the Hurricane belt, your metaphor reminds me of another one. The calm before the storm.

I think we may be in the eye on the storm. We just had crazy high prices and scarcity. After the eye passes over, everything is going to flip directions, and there will be a glut of cards and lower prices.

Or at least, thats what I hope happens

GodisanAtheist · Jun 30, 2022

coercitiv said:
I think people underestimate what's about to happen: between the economic earthquake and the crypto tsunami, video cards of all generations will be bouncing around like debris in a hurricane.

In case the metaphors weren't enough, there's a perfect storm coming.

- My finger has been hovering above the buy button on second hand 6700XTs as they approach the $350 price point, which would still represent a doubling of performance for me... but now I'm thinking even $250 might be too much for one...

I think it makes sense to play the gap between the new generation's announcement and the actual launch of the cards (which given the hunger in the market for affordable cards will likely still be a huge crapshow in terms of supply meeting demand).

If the 7600XT comes in at ~$379 and provides 6900XT performance, its going to put a real hard cap on how much RDNA2 can resell for across the stack.

beginner99 · Jun 30, 2022

coercitiv said:
I think people underestimate what's about to happen: between the economic earthquake and the crypto tsunami, video cards of all generations will be bouncing around like debris in a hurricane.

In case the metaphors weren't enough, there's a perfect storm coming.

Yeah I think demand will fall sharply due to less money from inflation, then a recession and people losing jobs...At the same time costs go up so I can't really see an N33 for $400 due to rising costs and sinking value of the dollar.

Then they will have to stop raising rates or even lower them again, start printing again. Once you start the cycle of printing and inflation there is no way out really until it all implodes in about 10 years and everyone runs to crypto.

jpiniero · Jun 30, 2022

GodisanAtheist said:
to put a real hard cap on how much RDNA2 can resell for across the stack.

That's exactly why N33 won't be that cheap because they are going to continue to sell the RDNA 2 Refresh below it.

maddie · Jun 30, 2022

jpiniero said:
That's exactly why N33 won't be that cheap because they are going to continue to sell the RDNA 2 Refresh below it.

If N33 is an 8 GB card, as seems likely, do you think they will be selling 6700XT 12GB refresh cards below it?

Can they use 6GB instead or do we get a NVIDIA scenario with lower cards having more memory than higher cards?

Will RDNA2 refresh cards offer the same performance & price as today with no perf/$ inprovement?

Saylick · Jun 30, 2022

Seeing that N33 is on TSMC N6, which is a cheaper node than N7, and it has roughly the same performance as the 6900XT at a smaller die size (~400mm2 vs 520mm2) and with less RAM, I think it's fair to say that N33 will be positioned at a price-point much lower than the 6900XT (and 6800XT for that matter). It would not make sense for AMD to try to inflate the price of N33 just so that they can continue selling a more costly (for them) RDNA 2 refresh. If anything, it would be in AMD's best interest to phase out RDNA 2 entirely and replace it with N34 to flesh out the lower-end of the performance stack.

jpiniero · Jun 30, 2022

Saylick said:
Seeing that N33 is on TSMC N6, which is a cheaper node than N7

No reason to think N6 wafers are cheaper than N7. Cheaper per transistor, yeah, but the N33 die is probably bigger than N22. Which of course they just released a product at $549 for the top model.

Doesn't look like AMD will do a new model any time soon below N33 because they are afraid of The Flood. But N33 shouldn't be affected since it should be faster than the 6900 XT.

Timorous · Jun 30, 2022

jpiniero said:
No reason to think N6 wafers are cheaper than N7. Cheaper per transistor, yeah, but the N33 die is probably bigger than N22. Which of course they just released a product at $549 for the top model.

Doesn't look like AMD will do a new model any time soon below N33 because they are afraid of The Flood. But N33 shouldn't be affected since it should be faster than the 6900 XT.

They can do a cut N33 if need be like the 5600XT was a cut N10

Karnak · Jun 30, 2022

Are we still on that "N33 is N6" train?

There were slides and there was an official roadmap. According to all of that (and even Wang mentioning this one while talking) RDNA3 is based on TSMC N5. And on top of that it's a "chiplet architecture".

So if N33 = RDNA3 it's a) not monolithic but rather based on their "chiplet architecture" and b) it's based on TSMC N5 and not N6.

Don't know why leakers keep saying it's monolithic and N6. Since the FAD we know that this is not the case.

Saylick · Jun 30, 2022

Karnak said:
Are we still on that "N33 is N6" train?

There were slides and there was an official roadmap. According to all of that (and even Wang mentioning this one while talking) RDNA3 is based on TSMC N5. And on top of that it's a "chiplet architecture".

So if N33 = RDNA3 it's a) not monolithic but rather based on their "chiplet architecture" and b) it's based on TSMC N5 and not N6.

Don't know why leakers keep saying it's monolithic and N6. Since the FAD we know that this is not the case.

I don't think David Wang calling RDNA 3 a "chiplet architecture" means that all forms of RDNA 3 will be chiplet based. I mean, we'll see RDNA 3 in monolithic designs, e.g. mobile APUs.

moinmoin · Jun 30, 2022

Saylick said:
I mean, we'll see RDNA 3 in monolithic designs, e.g. mobile APUs.

But Phoenix Point uses chiplet tech as well.

Saylick · Jun 30, 2022

moinmoin said:
But Phoenix Point uses chiplet tech as well.

Sure, even if Phoenix Point is RDNA 3 and is a chiplet-based processor, that doesn't change the fact that RDNA 3 can be in a monolithic design. Zen 2 and Zen 3 are arguably "chiplet architectures" yet we obviously saw them in monolithic designs. For the same reason, maybe Phoenix Point uses RDNA 3 in a chiplet fashion, but come some point down the road we see a budget monolithic APU that uses RDNA 3 when RDNA 4 is coming out. Point is, there's nothing that requires RDNA 3 to be in a chiplet-based form.

Frenetic Pony · Jul 1, 2022

Karnak said:
Are we still on that "N33 is N6" train?

There were slides and there was an official roadmap. According to all of that (and even Wang mentioning this one while talking) RDNA3 is based on TSMC N5. And on top of that it's a "chiplet architecture".

So if N33 = RDNA3 it's a) not monolithic but rather based on their "chiplet architecture" and b) it's based on TSMC N5 and not N6.

Don't know why leakers keep saying it's monolithic and N6. Since the FAD we know that this is not the case.

Really, that "leak", if it was one, from LinkedIn could just be that engineer only working on SRAM or IO chiplets, but not on the third RDNA3 compute chip. If (as an example) you work on N31, N32, and SRAM chiplets, then leave before N33 really gets going how do you put that on your resume? Well maybe you put down exactly what we saw.

moinmoin · Jul 1, 2022

Saylick said:
Sure, even if Phoenix Point is RDNA 3 and is a chiplet-based processor, that doesn't change the fact that RDNA 3 can be in a monolithic design. Zen 2 and Zen 3 are arguably "chiplet architectures" yet we obviously saw them in monolithic designs. For the same reason, maybe Phoenix Point uses RDNA 3 in a chiplet fashion, but come some point down the road we see a budget monolithic APU that uses RDNA 3 when RDNA 4 is coming out. Point is, there's nothing that requires RDNA 3 to be in a chiplet-based form.

The question is whether we will see a monolithic mobile APU containing RDNA 3 anytime soon. The one we expected, Phoenix, was then announced to use chiplet tech (whatever that means in detail) at FAD, and other potential candidates aren't known yet.

TESKATLIPOKA · Jul 1, 2022

If Phoenix Point is not monolithic, then I wonder which chip contains what.
It will be interesting to see, If we will see 6 WGP IGP for Phoenix Point U or only for higher TDP models.

beginner99 · Jul 2, 2022

moinmoin said:
The question is whether we will see a monolithic mobile APU containing RDNA 3 anytime soon. The one we expected, Phoenix, was then announced to use chiplet tech (whatever that means in detail) at FAD, and other potential candidates aren't known yet.

Given that from leaks we know Zen4 IOD wlll contain a very basic iGPU, I think there isn't much room for some half-backed solution. The Zen4 iGPU should cover a ton of cases.basci graphics and video decode. Most people don't need more than that really.

APUs with chiplet tech can give the "i"GPU a lot of cache so it can work much better with the limited bandwidth. Ultimatley you would simply not need a dGPU anymore in mobile devices.

Leeea · Jul 2, 2022

coercitiv said:
I think people underestimate what's about to happen: between the economic earthquake and the crypto tsunami, video cards of all generations will be bouncing around like debris in a hurricane.

In case the metaphors weren't enough, there's a perfect storm coming.

I am hoping for that.

I have no desire to buy another this gen GPU, but if next gen can double the performance of my current GPU, I am hoping to buy.

Thing is, I am hoping next gen gets released, dives a bit, and I can get that performance $900 USD-ish. I am willing to wait several months after launch, my current GPU will be good enough for a while.

(yea, I know I am in fantasy land)

DisEnchantment · Jul 3, 2022

maddie said:
Also found the commit indicating RDNA3 has 1/2 DPFP of RDNA2 (i.e. 1/16 in RDNA2 vs 1/32 in DRNA3) throughput. which could support the idea of 2x FP throughput per CU

How does this work. This makes no sense to me.

I found a more precise answer to your question. It seems that RDNA has a separate 2 lane DPFP VALU per SIMD and 32 lane (hence SIMD32) VALU, thus FP32:FP64 ratio is 2:32 (i.e. 1:16).
This 2 lane DPFP VALU can run in parallel with the 32 lane FP32 VALU.

On RDNA3, there is a second 32 lane VALU and that is why they now have 1:32 DPFP throughput. ( i.e. 2 : 2x32 )

So indeed, AMD is going to project 2x FP32 throughput per SIMD32 in RDNA3.
So applications which rely a lot on vector f32_mul/add or f32_mul_mul/mul_add/add_add/add_mul + accumulate are going to get a great boost. i.e rendering apps.
Games would benefit opportunistically when the kernel instructions can be reorganized so that operand dependency order could be such that 2x FP32 vector ops can be done per cycle per SIMD32 including FMA types ops. VGPR has some new swizzling modes to support the operand gather to deliver the operands needed for the VOPD dual issue ops.
Additionally, we can surmise that VOPD does not work in wave64 is because RDNA executes wave64 in back to back wave32 on same SIMD32 not across two SIMD32s which is basically what the patch is doing.

Continuing on...
I find it very intriguing that AMD is going back to 4x SIMD per CU, back to GCN, albeit this time it is 4xSIMD32 per CU on RDNA3 vs 4xSIMD16 per CU on GCN
On GCN 16x SIMD16 (4CUs) share the same frontend and now it comes full circle, 8x SIMD32 (2CUs / 1WGP) share same frontend.

VGPR per SIMD32 seems to be the same as per LLVM though.
I assumed they started profiling and found that allocating 1x frontend per 4 SIMD32 in RDNA seems excessive.
Another is that, vector L0 of each CU in a WGP contain much duplicated data that it makes more sense to make all SIMDs in a WGP share the same L0.
We can recollect that L0 Instruction cache is already shared across WGP, but vector L0 is not.

In short I believe in RDNA3, they merged 2 CUs in one, and combine their L0s and LDSs. In WGP mode LDS is mergeable and shareable across all CUs anyway (like RDNA1/2) and potentially L0 is also shared across the WGP. With this they can reduce data duplication across CUs, increase cache size and hit rate, reduce duplication of frontend by 1/2 and address the programming quirks mentioned in the optimization guide for RDNA (see YT video by Lou Kramer and also in the RDNA whitepaper)

At CU level,

4x SIMD32 per CU, each SIMD32 has 2x FP32 VALU, 4x the FP32 throughput of RDNA2 CU theoretically, should be consistent in rendering loads
L0 in each CU is doubled (shared by all SIMD32s like in RDNA1/2)
LDS is doubled (shared by all SIMDs like in RDNA1/2)

At WGP level

8x SIMD32 in one WGP
L0 is now 4x in size and accessible by the entire WGP (in RDNA1/2 it is not, but combining them would remove data duplication improving hit rate due to more space)
LDS is now 4x (accessible by the entire WGP like in RDNA1/2)

So each RDNA3 CU is quite a fat CU. So GL1 has lesser clients. L0 has more clients but is fatter and therefore more hit rate.

Memory model for RDNA3 is still not fully updated in LLVM (as per dev comments), it is the one to watch out for.

https://github.com/llvm/llvm-project/blob/main/llvm/docs/AMDGPUUsage.rst#L11175

Going by the looks of how AMD will deliver performance in Zen4, I think the same approach they will take on RDNA3. Narrower but much faster clocks.

Kepler_L2 · Jul 3, 2022

DisEnchantment said:
I found a more precise answer to your question. It seems that RDNA has a separate 2 lane DPFP VALU per SIMD and 32 lane (hence SIMD32) VALU, thus FP32:FP64 ratio is 2:32 (i.e. 1:16).
This 2 lane DPFP VALU can run in parallel with the 32 lane FP32 VALU.

On RDNA3, there is a second 32 lane VALU and that is why they now have 1:32 DPFP throughput. ( i.e. 2 : 2x32 )
View attachment 63983
So indeed, AMD is going to project 2x FP32 throughput per SIMD32 in RDNA3.
So applications which rely a lot on vector f32_mul/add or f32_mul_mul/mul_add/add_add/add_mul + accumulate are going to get a great boost. i.e rendering apps.
Games would benefit opportunistically when the kernel instructions can be reorganized so that operand dependency order could be such that 2x FP32 vector ops can be done per cycle per SIMD32 including FMA types ops. VGPR has some new swizzling modes to support the operand gather to deliver the operands needed for the VOPD dual issue ops.
Additionally, we can surmise that VOPD does not work in wave64 is because RDNA executes wave64 in back to back wave32 on same SIMD32 not across two SIMD32s which is basically what the patch is doing.

Continuing on...
I find it very intriguing that AMD is going back to 4x SIMD per CU, back to GCN, albeit this time it is 4xSIMD32 per CU on RDNA3 vs 4xSIMD16 per CU on GCN
On GCN 16x SIMD16 (4CUs) share the same frontend and now it comes full circle, 8x SIMD32 (2CUs / 1WGP) share same frontend.

VGPR per SIMD32 seems to be the same as per LLVM though.
I assumed they started profiling and found that allocating 1x frontend per 4 SIMD32 in RDNA seems excessive.
Another is that, vector L0 of each CU in a WGP contain much duplicated data that it makes more sense to make all SIMDs in a WGP share the same L0.
We can recollect that L0 Instruction cache is already shared across WGP, but vector L0 is not.

In short I believe in RDNA3, they merged 2 CUs in one, and combine their L0s and LDSs. In WGP mode LDS is mergeable and shareable across all CUs anyway (like RDNA1/2) and potentially L0 is also shared across the WGP. With this they can reduce data duplication across CUs, increase cache size and hit rate, reduce duplication of frontend by 1/2 and address the programming quirks mentioned in the optimization guide for RDNA (see YT video by Lou Kramer and also in the RDNA whitepaper)

At CU level,

4x SIMD32 per CU, each SIMD32 has 2x FP32 VALU, 4x the FP32 throughput of RDNA2 CU theoretically, should be consistent in rendering loads

L0 in each CU is doubled (shared by all SIMD32s like in RDNA1/2)

LDS is doubled (shared by all SIMDs like in RDNA1/2)

At WGP level

8x SIMD32 in one WGP

L0 is now 4x in size and accessible by the entire WGP (in RDNA1/2 it is not, but combining them would remove data duplication improving hit rate due to more space)

LDS is now 4x (accessible by the entire WGP like in RDNA1/2)

So each RDNA3 CU is quite a fat CU. So GL1 has lesser clients. L0 has more clients but is fatter and therefore more hit rate.

Memory model for RDNA3 is still not fully updated in LLVM (as per dev comments), it is the one to watch out for.

https://github.com/llvm/llvm-project/blob/main/llvm/docs/AMDGPUUsage.rst#L11175

Going by the looks of how AMD will deliver performance in Zen4, I think the same approach they will take on RDNA3. Narrower but much faster clocks.

The Dual Vector ALU patent suggests that wave64 is split in two SIMD32s to execute in a single cycle.

DisEnchantment · Jul 4, 2022

Kepler_L2 said:
The Dual Vector ALU patent suggests that wave64 is split in two SIMD32s to execute in a single cycle.

I think you meant wave64 is split in two 32 lane VALUs on same SIMD like shown (not two SIMD32 from said patent)

Currently wave64 is executed back to back on same SIMD (from manual), of course this may change.

To handle wave64 instructions, the wave controller issues and executes two wave32 instructions, each operating on half of the work-items of the wave64 instruction. The default way to handle a wave64 instruction is simply to issue and execute the upper and lower halves of each instruction back-to-back – conceptually slicing every instruction horizontally.

For now we have seen VOPD patch only for wave32, but wave64 should be doable for some ops (e.g. those that do not require 3 operands like VOP3 ops)

⚙ D128218 [AMDGPU] gfx11 VOPD instructions MC support

reviews.llvm.org

wave64 is not supported at the moment for VOPD as can be seen from below listing file.

llvm-project/vopd.s at main · llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. Note: the repository does not accept github pull requests at this moment. Please submit your patches at...

github.com

v_dual_mul_f32 v0, v0, v2 :: v_dual_mul_f32 v1, v1, v3
// GFX11: encoding: [0x00,0x05,0xc6,0xc8,0x01,0x07,0x00,0x00]
// W64-ERR: :[[@LINE-2]]:{{[0-9]+}}: error

v_dual_fmaak_f32 v122, s74, v161, 2.741 :: v_dual_mov_b32 v247, 2
// GFX11: encoding: [0x4a,0x42,0x51,0xc8,0x82,0x00,0xf6,0x7a,0x8b,0x6c,0x2f,0x40]
// W64-ERR: :[[@LINE-2]]:{{[0-9]+}}: error

With wave32 they can mix and match ops where all operands can be fed by 4 VPGR banks, immediates and constant operands from SGPRs , (see quote from manual)

While the vector ALUs primarily read data from the high-bandwidth vGPRs, the scalar register file now supplies up to two broadcast operands per clock to every lane.

LLVM patches mentioned VPGR count is still 256, which means they are still limited to 4 banks only which means 4 vector operands from the VGPR per cycle (not enough for complete VOP3 dual issue ).
But wave64 is not so critical for games anyway.

I suspect 6 VGPR banks will come from RDNA4 and beyond and this can produce 6 vector operands per cycle to fully support real dual issue for all VOP3 ops. Or if each VPGR bank can produce 2 operands per cycle in RDNA3? probably not true as can be seen from LLVM patches.
Another possibility is 64 wide VGPR bank, like the DC GPUs but unlikely. It will be very fat.

leoneazzurro · Jul 4, 2022

Now it would be interesting to know exactly how N31-N33 are configured. We saw some numbers leaked for the total number of units of RDNA3's SKus but this double lane FP32 seems to throw all rumors off. That is, most rumors counted for FP32 throughput calculations the same number of FP32 unit per CU as RDNA 1/2. But if this is really doubled, well... N31 would be a 24576 FP32 unit GPU, with clocks supposedly higher than RDNA2. That would be a 122TF@2500MHz card. Granted, as these units would do also INT and other operations their efficiency would not improve -in that regard- respect to RDNA2. It would also explain how a N33 is supposed to be as fast or faster than a N21 even having less CUs and less bandwidth while not having supposedly way higher clock speeds (sorta similar to the 2080Ti to 3080 jump).

DisEnchantment · Jul 4, 2022

leoneazzurro said:
Now it would be interesting to know exactly how N31-N33 are configured. We saw some numbers leaked for the total number of units of RDNA3's SKus but this double lane FP32 seems to throw all rumors off. That is, most rumors counted for FP32 throughput calculations the same number of FP32 unit per CU as RDNA 1/2. But if this is really doubled, well... N31 would be a 24576 FP32 unit GPU, with clocks supposedly higher than RDNA2. That would be a 122TF@2500MHz card. Granted, as these units would do also INT and other operations their efficiency would not improve -in that regard- respect to RDNA2. It would also explain how a N33 is supposed to be as fast or faster than a N21 even having less CUs and less bandwidth while not having supposedly way higher clock speeds (sorta similar to the 2080Ti to 3080 jump).

At the moment VGPR cannot produce enough operands per cycle for all out dual FMA in all cases.
But you can see the compiler devs trying to weave some magic to optimize the wave execution, such that they reorder ops to make sure v_dual_xxx gets higher chance of scheduling, lets see what software optimization can do.
But I think it should be at least 30%-35% more throughput than single 32 lane VALU to be worth the effort.
But I think we are far away from the LLVM patches being done, Memory model is not even there yet. So they are keeping juicy bits as late as possible.
You can see they don't have test cases where v_dual_xxx can be tested, except image and memory instructions, They need real wavefront tests not only instruction tests.

leoneazzurro · Jul 4, 2022

DisEnchantment said:
At the moment VGPR cannot produce enough operands per cycle for all out dual FMA in all cases.
But you can see the compiler devs trying to weave some magic to optimize the wave execution, such that they reorder ops to make sure v_dual_xxx gets higher chance of scheduling, lets see what software optimization can do.
But I think it should be at least 30%-35% more throughput than single 32 lane VALU to be worth the effort.
But I think we are far away from the LLVM patches being done, Memory model is not even there yet. So they are keeping juicy bits as late as possible.
You can see they don't have test cases where v_dual_xxx can be tested, except image and memory instructions, They need real wavefront tests not only instruction tests.

Yes, that is why I said that efficiency in term of real throughput / number of FP32 units will likely decrease. But total throughput per CU would be significantly higher.

Question Speculation: RDNA3 + CDNA2 Architectures Thread

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Golden Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Golden Member

Senior member

Golden Member

Golden Member

Golden Member

Golden Member