Question Speculation: RDNA3 + CDNA2 Architectures Thread

uzzi38 · Jan 23, 2021

Man I have been dying to make this one for a while now.

First rumours for RDNA3 are here so new thread time!

Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3 is much bigger than from RDNA1 to RDNA2. We should expect many big improvements in GFX11. 🤔" / Twitter

Karnak · Jul 5, 2022

Considering the 384-bit bus for N31 and if we're going with 20 Gbps GDDR6 that's a total of 960 GB/s bandwidth without any kind of IF$.

And with that in mind you really don't need more than 192 MByte of IF$. That's still a +50% increase over N21 on top of the already way higher bandwidth. Just quoting GN here: anything else would be "a waste of sand". I'm 100% certain that 192 MByte will be the maximum we'll see.

Saylick · Jul 5, 2022

Karnak said:
Considering the 384-bit bus for N31 and if we're going with 20 Gbps GDDR6 that's a total of 960 GB/s bandwidth without any kind of IF$.

And with that in mind you really don't need more than 192 MByte of IF$. That's still a +50% increase over N21 on top of the already way higher bandwidth. Just quoting GN here: anything else would be "a waste of sand". I'm 100% certain that 192 MByte will be the maximum we'll see.

What are you basing this assertion that 192 MB is enough on? Is it from the hit rate chart that AMD provided for N21? If so, the curve for 4K resolution still didn't look like it was plateauing just yet. Furthermore, I assume the hit rate is a function of the number of local caches it needs to service. With more CUs, the Infinity Cache needs to scale up proportionally at the least in order to maintain the same hit rate.

Kepler_L2 · Jul 5, 2022

Saylick said:
What are you basing this assertion that 192 MB is enough on? Is it from the hit rate chart that AMD provided for N21? If so, the curve for 4K resolution still didn't look like it was plateauing just yet. Furthermore, I assume the hit rate is a function of the number of local caches it needs to service. With more CUs, the Infinity Cache needs to scale up proportionally at the least in order to maintain the same hit rate.

N21 is 128MB for 80 CUs, N31 is 192MB for 96 CUs.

maddie · Jul 5, 2022

Kepler_L2 said:
N21 is 128MB for 80 CUs, N31 is 192MB for 96 CUs.

And if you take at face value, the claim of 50% performance increase, you will need to churn through 50 % more data/sec. Back to square 1.

Karnak · Jul 6, 2022

Saylick said:
What are you basing this assertion that 192 MB is enough on?

Again: There's a 384-bit bus and assuming 20 Gbps GDDR6 (which is most likely IMO by the end of 2022, if not even 24 Gbps) that already would be a total of 960 GB/s.

512 GB/s (N21) vs. 960 GB/s (N31) = around +88%
128 MByte IF$ (N21) vs. 192 MByte IF$ (N31) = +50%

Literally I see no reason - since you need to consider both the raw bandwidth and not just the IF$ - why you need more than 192 MByte for the latter if you're having almost 1 TB/s of raw bandwidth already. Waste of sand and I'm sure that's how AMD feels and why we won't see more than that.

leoneazzurro · Jul 6, 2022

IC provides for bandwidth amplification. We are probably looking at an effective bandwidth @4K of around 3-3,5TB/s even with only 192 Mbytes of cache. It depends then on how much BW the chip needs with the increased FP capability (including all the other improvements). A 32Mb cache die + 64b VRAM channel is certainly small but not so small as the Vcache on Ryzen because it will integrate also the RAM controller (which scales quite badly compared to the rest ). So maybe a 40-50mm^2 on N6 if 32MB and 70mm^2 if 64Mb? 64 bit channel is due if it has to be reused on N32...

DisEnchantment · Jul 6, 2022

Karnak said:
Literally I see no reason - since you need to consider both the raw bandwidth and not just the IF$ - why you need more than 192 MByte for the latter if you're having almost 1 TB/s of raw bandwidth already. Waste of sand and I'm sure that's how AMD feels and why we won't see more than that.

No reason doesn't seem like an engineering principle to me.

leoneazzurro said:
IC provides for bandwidth amplification.

DRAM BW is one design consideration, the moment you allow client requests to get to RAM, it will get oversaturated. DRAM has recovery times, row and column strobes take time and there are burst length.
The purpose of cache design is not only to provide the data on a hit, but to cluster loads and stores so that the DRAM does not come to a crawl.
On CPU with 8 core it won't be visible. On GPUs it is totally different game.
Imagine a thread running on a CU requesting memory if not there, stalls then wave controller switch to another thread and it request memory again. How many threads a wave controller can support, then how many CUs.

tomatosummit · Jul 6, 2022

The rdna2 IC graph certainly looked like 4k resolutions needed 192MB for n21.
But n31 could need more for any number of reasons.
Why stop at 4k, ultra widescreen and super sampling are an option for halo graphics cards.
Ray tracing was suggested to benefit from cache.
The memory controller chiplets are probably limited by the phy sizes, why not put as much cache as you can on there. It might not all go to use in all situations but it's hadly going to hurt performance.

I hope the phoenix reumour using the cache chiplet is true though. I can't believe the performance claims with only ddr5 but I wonder if phoenix will use the option to use a couple of gddr6 modules coming off the chiplet for it's top end options.

leoneazzurro · Jul 6, 2022

Enhanced Wave64 mode/dual mode Wave32 confirmed?

https://twitter.com/x/status/1544565725225943041

Stuka87 · Jul 6, 2022

leoneazzurro said:
Enhanced Wave64 mode/dual mode Wave32 confirmed?

https://twitter.com/x/status/1544565725225943041

I saw the blurry text and immediately thought of this

And well, certainly not confirmed, as only AMD can do that. But I think it goes along with what has already been discussed here.

DisEnchantment · Jul 6, 2022

leoneazzurro said:
Enhanced Wave64 mode/dual mode Wave32 confirmed?

It is the same stuff discussed 2 pages ago, dual wave32 or one cycle wave64. (caveats being that 4 operand from VGPR banks with scalar or immediate)
At least dual 32 lane VALU is confirmed pretty much though.

For more background, that value in the twitter post is a bit mask related to the perf counter logged by the instruction sequencer (or frontend) in a WGP.

Actually more crazy stuff are there but not sure if they are just placeholders for future SoC.
e.g.
32 threads in the Texture Addressing Unit which perform RT with traversal capability, texture (i.e. BVH) load capability.
Dual Geometry Engine

leoneazzurro · Jul 6, 2022

DisEnchantment said:
It is the same stuff discussed 2 pages ago, dual wave32 or one cycle wave64. (caveats being that 4 operand from VGPR banks with scalar or immediate)
At least dual 32 lane VALU is confirmed pretty much though.

For more background, that value in the twitter post is a bit mask related to the perf counter logged by the instruction sequencer (or frontend) in a WGP.

Actually more crazy stuff are there but not sure if they are just placeholders for future SoC.
e.g.
32 threads in the Texture Addressing Unit which perform RT with traversal capability, texture (i.e. BVH) load capability.
Dual Geometry Engine

Yes I know it was referring to what it was already discussed, but as these drivers seems to be confirming it more and more, I posted this tweet. Also, there is a reference to the patch.

Tuna-Fish · Jul 6, 2022

Saylick said:
Furthermore, I assume the hit rate is a function of the number of local caches it needs to service.

It's not. The hit rate plateaus when it successfully starts caching all cacheable data between frames. This means that if you get up near that regime, how fast or wide your GPU is has no direct impact on how much cache you need. Only your workload matters. Cache demand will go up in the future if/when resolutions go up or when devs use more data per pixel.

(Of course, faster GPUs tend to be used for larger workloads.)

moinmoin · Jul 6, 2022

leoneazzurro said:
Yes I know it was referring to what it was already discussed, but as these drivers seems to be confirming it more and more, I posted this tweet. Also, there is a reference to the patch.

It's referring to 4 months old changes though.

https://twitter.com/x/status/1544611149118939136

leoneazzurro · Jul 6, 2022

Yes, but does this change anything? It's clear that Kepler has also other sources other than a driver patch.

Timorous · Jul 8, 2022

Soo what does the stack look like based on latest rumours?

N31 = 7900 series with 384bit bus and 24GB ram
N32 = 7800 series with 256bit bus and 16GB ram
N33 = 7600 series with 128bit bus and 8GB ram.

Is there an N34 for 7700 Series or will AMD go the 5600XT route and make 7700 from cut down N32's with a 192 bit bus? I guess it makes some level of sense to use cut N32 because I expect N32 will be the most produced N5 GPU chip so creating a SKU that uses parts that are partially defective or fail 7800 series binning is probably a smart play to maximise yields.

Also is single GCD absolutely confirmed for N31 and N32 or is dual GCD still possible? If dual GCD is still on the table could AMD use a single N31 GCD design to power the 7700 series? 7900 being 2x N31, 7800 being 2x N32 and 7700 being 1x N31. 1x N32 would match N33 in spec but could AMD use that chiplet in laptops or maybe in a 7600 series refresh sku as more 5nm capacity comes online?

Kepler_L2 · Jul 8, 2022

Yes the 7700XT will be a cutdown N32. One interesting thing about the Navi3x MCM design is that when they cutdown VRAM/cache they will actually ship less silicon! This makes configurations like 320-bit 7900XT and 192-bit 7700XT much more palatable for AMD.

jpiniero · Jul 8, 2022

Given what they are likely charging for it, putting the N33 in the 7600 range would be a bad idea.

It'd just be simple enough to call N31 7900/XT, N32 7800/XT, and N33 7700/XT.

Kepler_L2 · Jul 8, 2022

jpiniero said:
Given what they are likely charging for it, putting the N33 in the 7600 range would be a bad idea.

It'd just be simple enough to call N31 7900/XT, N32 7800/XT, and N33 7700/XT.

N33 in the 7700XT would mean a 20% reduction in CU count, 33% reduction in VRAM and 50% reduction in PCIe lanes vs the 6700 XT. Yes it's quite a bit faster, but I feel the community would destroy AMD over this kind of downgrade.

jpiniero · Jul 8, 2022

Kepler_L2 said:
N33 in the 7700XT would mean a 20% reduction in CU count, 33% reduction in VRAM and 50% reduction in PCIe lanes vs the 6700 XT. Yes it's quite a bit faster, but I feel the community would destroy AMD over this kind of downgrade.

I thought N33 was 40 CUs or the equivalent of that.

Might be time for a new numbering scheme.

Kepler_L2 · Jul 8, 2022

jpiniero said:
I thought N33 was 40 CUs or the equivalent of that.

Might be time for a new numbering scheme.

The dies are 96, 64 and 32 CUs, with 128 SPs per CU now.

leoneazzurro · Jul 8, 2022

jpiniero said:
I thought N33 was 40 CUs or the equivalent of that.

Might be time for a new numbering scheme.

N33 seems to be 4096 SP, organized in 32 CU, with each CU having double the FP resources of RDNA2's CU and probably (way) higher clock.
Rumors seems to say that performance it is on par with a 6900XT (full HD). I don't know how reliable that could be, but at this point if that is true a N31 (3x N33) seems quite possibly having the capability to be 2,5+X a 6900XT, in raster.
I'd add that, due to the extreme modularity of the memory configuration, it's very possible we will get a lot of different SKU this time.

Glo. · Jul 8, 2022

AMD can still cut down heavily the N33 die, into 3 SKUs.

If Kepler is correct, and N33 is going to be 7600 Series, then we are looking at 7600 XT, 7600, and potentially - 7500 XT.

However, we know that it was required for AMD to disable pairs of CPUs in order to get usable dies.

Is it possible that AMD is required to disable pairs of WGPs now?

7600 XT - 32 CUs,
7600 - 24 CUs,
7500 XT - 16 CUs, or...
7600 XT - 32 CUs,
7600 - 28 CUs,
7500 XT - 24 CUs?

For 7500 XT, AMD can give it 96 bit bus(6GB VRAM), and sell it for over 200$(250-270).

jpiniero · Jul 8, 2022

Glo. said:
AMD can still cut down heavily the N33 die, into 3 SKUs.

Yields are so good at TSMC that shouldn't be necessary.

TESKATLIPOKA · Jul 8, 2022

Kepler_L2 said:
N33 in the 7700XT would mean a 20% reduction in CU count, 33% reduction in VRAM and 50% reduction in PCIe lanes vs the 6700 XT. Yes it's quite a bit faster, but I feel the community would destroy AMD over this kind of downgrade.

32CU instead of 40CU is not a downgrade when the performance increases by a lot. The real downgrade is 8GB Vram. If It had 12GB, then I don't think 8 PCIe4 lanes would really matter.

Kepler_L2 said:
The dies are 96, 64 and 32 CUs, with 128 SPs per CU now.

Do we get 64 SPs(shaders) per SIMD32 or there is 2x more SIMD32 per CU?

Question Speculation: RDNA3 + CDNA2 Architectures Thread

Platinum Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Golden Member

Golden Member

Member

Golden Member

Diamond Member

Golden Member

Golden Member

Golden Member

Diamond Member

Golden Member

Golden Member

Senior member

Lifer

Senior member

Lifer

Senior member

Golden Member

Diamond Member

Lifer

Platinum Member