I found a more precise answer to your question. It seems that RDNA has a separate 2 lane DPFP VALU per SIMD and 32 lane (hence SIMD32) VALU, thus FP32:FP64 ratio is 2:32 (i.e. 1:16).
This 2 lane DPFP VALU can run in parallel with the 32 lane FP32 VALU.
On RDNA3, there is a second 32 lane VALU and that is why they now have 1:32 DPFP throughput. ( i.e. 2 : 2x32 )
View attachment 63983
So indeed, AMD is going to project 2x FP32 throughput per SIMD32 in RDNA3.
So applications which rely a lot on vector f32_mul/add or f32_mul_mul/mul_add/add_add/add_mul + accumulate are going to get a great boost. i.e rendering apps.
Games would benefit opportunistically when the kernel instructions can be reorganized so that operand dependency order could be such that 2x FP32 vector ops can be done per cycle per SIMD32 including FMA types ops. VGPR has some new swizzling modes to support the operand gather to deliver the operands needed for the VOPD dual issue ops.
Additionally, we can surmise that VOPD does not work in wave64 is because RDNA executes wave64 in back to back wave32 on same SIMD32 not across two SIMD32s which is basically what the patch is doing.
Continuing on...
I find it very intriguing that AMD is going back to 4x SIMD per CU, back to GCN, albeit this time it is
4xSIMD32 per CU on RDNA3 vs 4xSIMD16 per CU on GCN
On GCN 16x SIMD16 (4CUs) share the same frontend and now it comes full circle, 8x SIMD32 (2CUs / 1WGP) share same frontend.
VGPR per SIMD32 seems to be the same as per LLVM though.
I assumed they started profiling and found that allocating 1x frontend per 4 SIMD32 in RDNA seems excessive.
Another is that, vector L0 of each CU in a WGP contain much duplicated data that it makes more sense to make all SIMDs in a WGP share the same L0.
We can recollect that L0 Instruction cache is already shared across WGP, but vector L0 is not.
In short I believe in RDNA3, they merged 2 CUs in one, and combine their L0s and LDSs. In WGP mode LDS is mergeable and shareable across all CUs anyway (like RDNA1/2) and potentially L0 is also shared across the WGP. With this they can reduce data duplication across CUs, increase cache size and hit rate, reduce duplication of frontend by 1/2 and address the programming quirks mentioned in the optimization guide for RDNA (see YT video by Lou Kramer and also in the RDNA whitepaper)
At CU level,
- 4x SIMD32 per CU, each SIMD32 has 2x FP32 VALU, 4x the FP32 throughput of RDNA2 CU theoretically, should be consistent in rendering loads
- L0 in each CU is doubled (shared by all SIMD32s like in RDNA1/2)
- LDS is doubled (shared by all SIMDs like in RDNA1/2)
At WGP level
- 8x SIMD32 in one WGP
- L0 is now 4x in size and accessible by the entire WGP (in RDNA1/2 it is not, but combining them would remove data duplication improving hit rate due to more space)
- LDS is now 4x (accessible by the entire WGP like in RDNA1/2)
So each RDNA3 CU is quite a fat CU. So GL1 has lesser clients. L0 has more clients but is fatter and therefore more hit rate.
Memory model for RDNA3 is still not fully updated in LLVM (as per dev comments), it is the one to watch out for.
Going by the looks of how AMD will deliver performance in Zen4, I think the same approach they will take on RDNA3. Narrower but much faster clocks.