Question Speculation: RDNA2 + CDNA Architectures thread

Page 117 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Platinum Member
Oct 16, 2019
2,703
6,405
146
All die sizes are within 5mm^2. The poster here has been right on some things in the past afaik, and to his credit was the first to saying 505mm^2 for Navi21, which other people have backed up. Even still though, take the following with a pich of salt.

Navi21 - 505mm^2

Navi22 - 340mm^2

Navi23 - 240mm^2

Source is the following post: https://www.ptt.cc/bbs/PC_Shopping/M.1588075782.A.C1E.html
 

raghu78

Diamond Member
Aug 23, 2012
4,093
1,475
136
RX 5700 XT has 1887Mhz clockspeed on average so 25% higher clockspeed for Big Navi is actually 2359Mhz, but the info is only about 2.2Ghz, which is only 16.6% better.
Another thing is that doubling the number of CUs won't increase your performance by 100%, but by ~90-95% and you also need double of ROPs.
It should be at least on the level of RTX3080, but reviews will tell us the truth.

RX 5700XT official game clock is 1755 Mhz but you can see it run anywhere between 1800 - 1850 Mhz depending on game. So if N21 official game clock is 2.2 Ghz we should not be surprised to see it run between 2250-2300 Mhz. So game clock to game clock its roughly 25%. More importantly RDNA2 has higher perf/clock (which we need to see in press reviews). RDNA2 has improved (and probably larger) caches. So here are the vectors

2x CU, 25% higher clocks, 10-15% higher perf/clock. Even without higher perf/clock N21 is going to end up > 2x RX 5700XT. With higher perf/clock it will pull even further ahead.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,440
2,926
136
RX 5700XT official game clock is 1755 Mhz but you can see it run anywhere between 1800 - 1850 Mhz depending on game. So if N21 official game clock is 2.2 Ghz we should not be surprised to see it run between 2250-2300 Mhz. So game clock to game clock its roughly 25%. More importantly RDNA2 has higher perf/clock (which we need to see in press reviews). RDNA2 has improved (and probably larger) caches. So here are the vectors

2x CU, 25% higher clocks, 10-15% higher perf/clock. Even without higher perf/clock N21 is going to end up > 2x RX 5700XT. With higher perf/clock it will pull even further ahead.
Didn't I also say I expect at least RTX 3080 performance? Because 2x 5700XT is RTX 3080 performance in 4K.
 

maddie

Diamond Member
Jul 18, 2010
4,789
4,773
136
if you see rdn2 with 256bit is faster then 3070 with same bus, then it can, we just don`t know in which games it save enough and what is the limit memory bus or GPU

for sure miners are DEAD and will not buy AMD anymore as mining is strictly tied by raw memory bandwith and bus

also don't forget wider bus is a lot expansive PCB design and GPU side controller
Yep, it appears than anything needing mainly unique data for each operation will be at a disadvantage. This explains the split to CDNA architecture for HPC, distributed computing, etc. This as a good thing for gamers.
 

Guru

Senior member
May 5, 2017
830
361
106
Cu's don't scale linearly, so doubling the cu's with a ton of optimizations in the process as well, will only yield about 70 to 80% faster performance, assuming the memory system and clocks are the same. We know it will have higher clocks, so maybe some good 15% more performance through that and it does come close to double the rx5700 xt performance, which would put it at around RTX 3080 performance numbers.
 

Glo.

Diamond Member
Apr 25, 2015
5,768
4,693
136
Cu's don't scale linearly, so doubling the cu's with a ton of optimizations in the process as well, will only yield about 70 to 80% faster performance, assuming the memory system and clocks are the same. We know it will have higher clocks, so maybe some good 15% more performance through that and it does come close to double the rx5700 xt performance, which would put it at around RTX 3080 performance numbers.
In order to go chiplet for dGPUs, you need to achieve best possible, almost perfect scaling with CU counts. First step to achieve this - redesigning the caches, to achieve highest possible internal bandwidth.

If RDNA2 brings those Cache improvements, that are designed for perfect, or almost perfect scaling - we should see better scaling with CU counts than what we have seen with RDNA1.

And we have seen that that archicteure achieved not 70% scaling with going from RX 5500 XT CU counts to RX 5700 XT CU counts, but 86%.
 

Glo.

Diamond Member
Apr 25, 2015
5,768
4,693
136
Scaling with CU cont going from 22 CU to 40 CU was around 93% wit RDNA1. Taking in account game clock to game clock comparison. If we took game clock on 5500XT vs the max clock on 5700XT we are just shy of 90%. But going from 40 to 80 CU is certainly a bigger step.
For chiplets, it doesn't matter how many CUs you have. You have to have best possible, linear scaling with CU counts.
 
Reactions: Tlh97

maddie

Diamond Member
Jul 18, 2010
4,789
4,773
136
For (GPU) chiplets... latencies are most important. Just saying
Please explain as every thing I've read by GPU designers have stated that increased latency is much easier to manage in gaming GPUs when compared to CPUs.

By the way, interposer based designs have interconnect latencies of 1-2 ns ( a several yrs old Xilinx slide).
 

omikun

Junior Member
Aug 22, 2007
7
3
81
Please explain as every thing I've read by GPU designers have stated that increased latency is much easier to manage in gaming GPUs when compared to CPUs.

GPUs have lots of latency hiding mechanisms but they incur costs. RDNA was designed to lower shader latencies by widening simd width to 32. The rational was lower latency means fewer warps needed said latency means smaller number of registers needed to sustain perf.
 

beginner99

Diamond Member
Jun 2, 2009
5,224
1,598
136
Please explain as every thing I've read by GPU designers have stated that increased latency is much easier to manage in gaming GPUs when compared to CPUs.

Yeah also wonder about the scaling. In fact the scaling would be far more important in a monolithic GPU as it makes a larger and larger die less and less useful performance wise and far more expensive manufacturing wise (yields!). Adding another "cheap" chiplet at sub-optimal scaling has a far lower cost than making a bigger die at sub-optimal scaling.

The real issue is latency and possibly bandwidth between chiplets and how to connect them together, eg some kind of IO die or does each chiplet go to memory directly? (unlikley as each one would need a memory controller)

In fact this last part could be the driving factor of the cache redesign. Too much traffic to IO and too high memory latency (chiplet->IO->Memory). So needing less access to memory would be a huge bonus.
 

Kuiva maa

Member
May 1, 2014
181
232
116
Well, with Zen 2 you got Epyc 2 chips containing 8 chiplets and thus up to 64 cores. Imagine that kind of scalability for GPU CUs without having to use a single ~500mm² monolith.

I always thought that GPU workloads were not that sensitive in latency and cherished bandwidth, given how videocards use GDDRx. So in theory splitting the shader array across chiplets could result in very big GPUs without a single big die used yes. I suppose if it was that easy they would have brought such a solution to the market already. But maybe it is not.
 
Reactions: Tlh97

moinmoin

Diamond Member
Jun 1, 2017
4,996
7,780
136
I always thought that GPU workloads were not that sensitive in latency and cherished bandwidth, given how videocards use GDDRx. So in theory splitting the shader array across chiplets could result in very big GPUs without a single big die used yes. I suppose if it was that easy they would have brought such a solution to the market already. But maybe it is not.
Bandwidth is definitely the much bigger issue, seeing how performance scalability comes to a screeching halt once e.g. the APUs hit the dual channel DDR4 bandwidth bottleneck. Within potential CU chiplets that bandwidth would need to be a multiple of the amount of chiplets to guarantee no bottleneck. But only a solution with some form of overprovisioning seems feasible there if the design ought to be realistically fully scalable. And at this point latency can become crucial as well, after all a GPU has only the time of a frame to do the work. The higher the framerate the less time there is, the more chiplets and the more unavoidable crosstalk between them the more latency can add up, ending up as a bottleneck for the achievable framerate.

With pure compute there's no framerate to keep in mind and the communication between chiplets can be avoided, so I expect CDNA to make use of a chiplets approach much earlier and als much stronger than RDNA ever will.
 
Reactions: Tlh97

Veradun

Senior member
Jul 29, 2016
564
780
136
Bandwidth is definitely the much bigger issue, seeing how performance scalability comes to a screeching halt once e.g. the APUs hit the dual channel DDR4 bandwidth bottleneck. Within potential CU chiplets that bandwidth would need to be a multiple of the amount of chiplets to guarantee no bottleneck. But only a solution with some form of overprovisioning seems feasible there if the design ought to be realistically fully scalable. And at this point latency can become crucial as well, after all a GPU has only the time of a frame to do the work. The higher the framerate the less time there is, the more chiplets and the more unavoidable crosstalk between them the more latency can add up, ending up as a bottleneck for the achievable framerate.

With pure compute there's no framerate to keep in mind and the communication between chiplets can be avoided, so I expect CDNA to make use of a chiplets approach much earlier and als much stronger than RDNA ever will.
I can't see a GPU chiplet solution without an active interposer tbh
 

kurosaki

Senior member
Feb 7, 2019
258
250
86
I always thought that GPU workloads were not that sensitive in latency and cherished bandwidth, given how videocards use GDDRx. So in theory splitting the shader array across chiplets could result in very big GPUs without a single big die used yes. I suppose if it was that easy they would have brought such a solution to the market already. But maybe it is not.
7970xt was a dual circuit card anyways.
 

maddie

Diamond Member
Jul 18, 2010
4,789
4,773
136
GPUs have lots of latency hiding mechanisms but they incur costs. RDNA was designed to lower shader latencies by widening simd width to 32. The rational was lower latency means fewer warps needed said latency means smaller number of registers needed to sustain perf.
GCN has a 64 thread wave issued 16 at a time, to a 64 shader CU. RDNA has a 32 thread wave issued all at once to a 32 shader array, all doubled up to allow backwards compatibility with GCN.

AFAIK, this was done to improve occupancy and reduce idle shader resources, not because data latency was a problem.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |