Question Speculation: RDNA2 + CDNA Architectures thread

uzzi38 · Apr 28, 2020

All die sizes are within 5mm^2. The poster here has been right on some things in the past afaik, and to his credit was the first to saying 505mm^2 for Navi21, which other people have backed up. Even still though, take the following with a pich of salt.

Navi21 - 505mm^2

Navi22 - 340mm^2

Navi23 - 240mm^2

Source is the following post: https://www.ptt.cc/bbs/PC_Shopping/M.1588075782.A.C1E.html

TESKATLIPOKA · Oct 10, 2020

double-post

sontin · Oct 10, 2020

You dont need more than a 256bit and 16gbit/s configuration to provide such performance. A RTX3070 is on par with a 2080TI and has 28% less bandwidth.

TESKATLIPOKA · Oct 10, 2020

sontin said:
You dont need more than a 256bit and 16gbit/s configuration to provide such performance. A RTX3070 is on par with a 2080TI and has 28% less bandwidth.

Only 512GB/s for 20TFlops 80CU RDNA2 GPU? Btw I would rather wait for reviews to see the real performance of RTX 3070.

sontin · Oct 10, 2020

RTX3070 has 20TFLOPs with the advertised boost clock and will be around 21TFLOPs with "gaming" clock. 16gbit/s is enough for 20TFLOPs to provide ~20% more performance over a RTX2080TI.

TESKATLIPOKA · Oct 10, 2020

sontin said:
RTX3070 has 20TFLOPs with the advertised boost clock and will be around 21TFLOPs with "gaming" clock. 16gbit/s is enough for 20TFLOPs to provide ~20% more performance over a RTX2080TI.

20-21TFlop RTX 3070 performs as RTX 2080Ti, right?
RTX2080Ti FE has officially 14.2Tflops and in reality 15.9TFlops(1824Mhz is the average clockspeed), that's a lot less than RTX 3070.
Not to mention you compare bandwidth needs of two different architectures Ampere vs RDNA2(doesn't have 2xFP32 per CU as Ampere per SM) and you don't even know what is the real performance of RTX 3070 compared to RTX 2080Ti in 4K.
BTW 16Ghz is only 14% more than 14Ghz.

insertcarehere · Oct 10, 2020

The MS Team clearly thought 256bits GDDR6 bandwidth wasn't enough to feed 12tflops RDNA2 + Zen 2 (which shouldn't take much in itself) , hence the awkward dual bandwidth memory solution. Curious to see how AMD plans to feed 20+ tflops RDNA2 SKUs on 256bit GDDR6 if that's true.

sontin · Oct 10, 2020

TESKATLIPOKA said:
20-21TFlop RTX 3070 performs as RTX 2080Ti, right?
RTX2080Ti FE has officially 14.2Tflops and in reality 15.9TFlops(1824Mhz is the average clockspeed)!
Not to mention you compare bandwidth needs of two different architectures Ampere vs RDNA2 and you don't even know what is the real performance of RTX 3070 compared to RTX 2080Ti in 4K.
BTW 16Ghz is only 14% more than 14Ghz.

The 3080 has 70% more bandwidth than the 3070 but it wont perform 70% better. Scaling architectures up will be getting harder and harder for gaming without increasing compute workload. It doesnt make sense to go overboard with bandwidth when the benefit isnt there. Microsoft and Sony have it easier with their 10/12TFLOPs consoles and certain bandwidth like Sony's is standard today.

I dont think there is an actual problem going with 256bit and 16gbit/s for AMD. This should be enough for ~90% of 3080 performance in average.

GaiaHunter · Oct 10, 2020

Kuiva maa said:
The 5700XT wasn't particularly efficient because it was clocked outside its efficiency range so AMD could compete with the 2070 in the market, while creating almost two 5700XT chips for every 2070 nvidia made (die area comparison). That's not to say AMD hasn't been improving in that area with RDNA2 but the difference might be a bit misleading.

Let me put it this way - if ampere was on TSMC 7nm instead of samsung it wouldn't be such a power hog and probably a way different spec with similar or higher performance.

If AMD gets a shot at being close or beating the 3080 it can thank NVIDIA choice of fab.

TESKATLIPOKA · Oct 10, 2020

insertcarehere said:
The MS Team clearly thought 256bits GDDR6 bandwidth wasn't enough to feed 12tflops RDNA2 + Zen 2 (which shouldn't take much in itself) , hence the awkward dual bandwidth memory solution. Curious to see how AMD plans to feed 20+ tflops RDNA2 SKUs on 256bit GDDR6 if that's true.

i kinda don't understand how It's wired. Xbox X has 16 chips, each is a 14GHz chip. 10 chips have 320bit bus or 32bit per chip and the last 6 have 192bit or also 32bit per chip.
How many PHY does Xbox SoC actually have or how It's wired?

edit: now I know. It has only 320bit GDDR6(10x 32bit chips) and If the 4 1GB chips are full, you can't use them and you have left 6x 32bit 2GB chips, that's why the bandwith is 336GB/s(6x32=192bit and 14GHz)

TESKATLIPOKA · Oct 10, 2020

sontin said:
The 3080 has 70% more bandwidth than the 3070 but it wont perform 70% better. Scaling architectures up will be getting harder and harder for gaming without increasing compute workload. It doesnt make sense to go overboard with bandwidth when the benefit isnt there. Microsoft and Sony have it easier with their 10/12TFLOPs consoles and certain bandwidth like Sony's is standard today.

I dont think there is an actual problem going with 256bit and 16gbit/s for AMD. This should be enough for ~90% of 3080 performance in average.

1. The increase in Ampere TFlops don't reflect the actual increase in gaming performance compared to Turing and RDNA1-2, because each SM has 2x as much FP32 compared to Turing per SM or RDNA1-2 per CU. Why do you think I mentioned TFlops for RTX2080Ti?
2. If you compare TFlops of RTX 3070 vs RTX 3080 then you will find out why It's not 70% better or even close to that number even with 70% higher bandwidth.
3. Please check what specs RX 5700 XT has and ask yourself again If 14% higher bandwidth or 512GB/s (256bit 16Ghz) is enough to feed twice as much CUs at possibly higher cockspeed.

Glo. · Oct 10, 2020

sontin said:
The 3080 has 70% more bandwidth than the 3070 but it wont perform 70% better. Scaling architectures up will be getting harder and harder for gaming without increasing compute workload. It doesnt make sense to go overboard with bandwidth when the benefit isnt there. Microsoft and Sony have it easier with their 10/12TFLOPs consoles and certain bandwidth like Sony's is standard today.

I dont think there is an actual problem going with 256bit and 16gbit/s for AMD. This should be enough for ~90% of 3080 performance in average.

Stop spreading BS about RTX 3070.

RTX 3080 has 54% more SMs than RTX 3070, and has 70% more memory bandwdith. At best, RTX 3080 is 30% faster than RTX 2080 Ti. Suddenly, you come here, to AMD thread, to talk about how magically RTX 3070 will be 20% faster than RTX 2080 Ti.

That GPU will not be faster than RTX 2080 Ti EVEN IN 1440P. It does not have enough bandwidth, nor enough horsepower in the SMs to beat 2080 Ti.

Nobody in this thread is stupid enough to buy what you are selling about RTX 3070.

TESKATLIPOKA · Oct 10, 2020

He didn't say RTX3070 will be 20% faster than RTX 2080Ti. It was actually meant that 16Ghz 256bit GDDR6 could provide enough bandwidth to feed a RDNA 2 GPU 20% more powerful than RTX 2080ti.

reb0rn · Oct 10, 2020

And did you forget infinity cash where they now save on addressing memory on GPU?

TESKATLIPOKA · Oct 10, 2020

reb0rn said:
And did you forget infinity cash where they now save on addressing memory on GPU?

So It saves almost half of the needed bandwidth? I will believe after seeing It.

Veradun · Oct 10, 2020

insertcarehere said:
The MS Team clearly thought 256bits GDDR6 bandwidth wasn't enough to feed 12tflops RDNA2 + Zen 2 (which shouldn't take much in itself) , hence the awkward dual bandwidth memory solution. Curious to see how AMD plans to feed 20+ tflops RDNA2 SKUs on 256bit GDDR6 if that's true.

Speculation here: maybe the desktop variant has a crapload of cache the apu version doesn't?

TESKATLIPOKA · Oct 10, 2020

Veradun said:
Speculation here: maybe the desktop variant has a crapload of cache the apu version doesn't?

And the point would be? I don't see a good reason to use up more die space for 128MB L2 or L3 instead of adding another 128bit GDDR6 memory controller + PHY, when the effect would be the same at best.

maddie · Oct 10, 2020

insertcarehere said:
The MS Team clearly thought 256bits GDDR6 bandwidth wasn't enough to feed 12tflops RDNA2 + Zen 2 (which shouldn't take much in itself) , hence the awkward dual bandwidth memory solution. Curious to see how AMD plans to feed 20+ tflops RDNA2 SKUs on 256bit GDDR6 if that's true.

The patent on the elimination of cache data replication could explain the lowered bandwidth needed. If you don't need to load multiple instances of either shader code or data, then that would free up a lot of data movement.

Let's assume a fps of 60, then you have 16.66ms/frame to store all of the necessary code and data for that frame. If you can now drop that data volume by 50% due to the elimination of replicated data in the local caches, you can get by with a 1/2 sized bus.

Maybe this is why AMD is concentrating on 4K in their messaging, as it seems that with higher framerates (lower rez), this sort of optimization will start to be overwhelmed as the time allowed for updating new data is reduced. Bandwidth x frametime (total data) falls below the necessary number needed for that new frame, even with no replication of data.

Nail waiting to be hammered prediction:
If this is close to being correct, we should see the RX 6xxx cards having less of a % reduction in framerate as resolution increases. This is opposite to what we've become accustomed to.

Wanted to add:
This caching scheme appears to be the intermediate step to full chiplet based GPUs for gaming. Accessing cache data from far regions at an energy efficient cost. Probably still too expensive on 7nm but doable at 5nm especially if used with advanced packaging to lower pj/bit.

reb0rn · Oct 10, 2020

TESKATLIPOKA said:
And the point would be? I don't see a good reason to use up more die space for 128MB L2 or L3 instead of adding another 128bit GDDR6 memory controller + PHY, when the effect would be the same at best.

Where did you see 128MB its FUD... they just redesign so each CU could read each other L2/L1 data and save on bandwith, for me its very smart move I bet NV will do something similar with next GPU

Kuiva maa · Oct 10, 2020

GaiaHunter said:
Let me put it this way - if ampere was on TSMC 7nm instead of samsung it wouldn't be such a power hog and probably a way different spec with similar or higher performance.

If AMD gets a shot at being close or beating the 3080 it can thank NVIDIA choice of fab.

I specifically spoke about RDNA1 vs Turing. AMD had the node advantage and it used it for market reasons. A theoretical navi with 60CU from a roughly 400mm2 die could match 5700XT-2070 in performance by utilizing low clocks while consuming way less than either. But AMD is not in the business of maximizing perf/watt but in the business of making profits, so they chose to manufacture more, smaller, more power hungry GPUs of equal performance than that theoretical bigger part. Makes sense? As to what would have happened if Ampere vs RDNA2 was done in the same node, we don't have enough data yet to guesstimate. Nvidia does have a big disadvantage however, at least in traditional graphics workloads- their chips now carry tensor and RT cores and those demand die area and consume power. Partly explains why nvidia is pushing DLSS so hard, else a considerable chunk of the die would be unused. Maybe AMD would have had a chance either way, by fielding an equally as large chip as GA102 that could be fully used for rasterization, no? We will see.

Martimus · Oct 10, 2020

PhoBoChai said:
There was an ES 3080Ti, it was supposed to be between 3080 and 3090.. but because we now know there's hardly any worthwhile gap, 10% between the two, it's a pointless SKU.

3090 is this gen's 3080Ti, its the fastest GAMING GPU from NV on this architecture. It's no Titan, don't let that bs marketing attempt slide.

I'm pretty sure everything pointed to a SKU between the 3070 and 3080, not between the 3080 and the 3090. Lots of rumors of a "3070 ti", and a picture of a partner with the unnamed SKU between the 3070 and 3080 on a slide.

Of course we will only know for sure if they actually launch. So I'm not too worried right now as I can wait.

darkswordsman17 · Oct 10, 2020

insertcarehere said:
The MS Team clearly thought 256bits GDDR6 bandwidth wasn't enough to feed 12tflops RDNA2 + Zen 2 (which shouldn't take much in itself) , hence the awkward dual bandwidth memory solution. Curious to see how AMD plans to feed 20+ tflops RDNA2 SKUs on 256bit GDDR6 if that's true.

I'd guess that's because of the new overall data system, tied to implementing NAND. They probably need some dedicated channels to manage the data just to/from that.

TESKATLIPOKA said:
And the point would be? I don't see a good reason to use up more die space for 128MB L2 or L3 instead of adding another 128bit GDDR6 memory controller + PHY, when the effect would be the same at best.

And your evidence showing that's the same? I'd guess the weird setup in the consoles is due to the NAND, where they probably need some channels linked to it (serving as buffer). Plus the consoles have specialized compression/decompression blocks, so its possible that Microsoft wants all the bandwidth they can get since they're working with a total pool of memory comparable to just the GPU memory (where with dGPU they'll use that extra memory to just keep it in VRAM versus juggling it to and from the SSD like the new consoles will apparently be doing so you can switch between games quickly).

I'm catching up now so has it been confirmed that's what the cache is? If not, then it seems odd to be making arguments without knowing. If it is, I think you have to also consider that RDNA2 isn't meant to be the final implementation (people talk about it as though it is revolutionary, but while it is obviously very significant, keep in mind, its an iteration on their GPU development path). So, its possible that the cache implementation is the start and we'll see it change in the future, where it later becomes a separate chip altogether, or integrated into I/O or some other chiplet. Which, I think that was the idea/plan for HBM (where it would function both as cache and memory), but for whatever reason it didn't work out that way.

As for why the consoles wouldn't have it, costs would be a big reason. Its why console versions of things tend to have smaller caches and the like, its an easy cost cutting part, where the highly leveraged programming of the consoles mixed with other limitations mitigates that somewhat.

One last bit. I personally have a hunch that Microsoft actually had a stronger chip (I think they were looking at 15TF possibly room to push higher) planned, but reined it in when they saw the PS4 was going to be substantially below them. Which, perhaps they kept CU count low (maybe they were looking at 60-64CU or something) for yields, or maybe they plan on iterating more quickly (since new consoles still won't apparently be quite enough for flawless 4K, and then add in ray-tracing), where perhaps they can add CUs without messing with memory configs or anything.

TESKATLIPOKA · Oct 10, 2020

reb0rn said:
Where did you see 128MB its FUD... they just redesign so each CU could read each other L2/L1 data and save on bandwith, for me its very smart move I bet NV will do something similar with next GPU

It was mentioned even here many times. And I also think It's BS. Can this redesign save at least a 1/3rd of the required bandwidth? If not, then I say we will see 384bit GDDR6.

reb0rn · Oct 10, 2020

if you see rdn2 with 256bit is faster then 3070 with same bus, then it can, we just don`t know in which games it save enough and what is the limit memory bus or GPU

for sure miners are DEAD and will not buy AMD anymore as mining is strictly tied by raw memory bandwith and bus

also don't forget wider bus is a lot expansive PCB design and GPU side controller

Kenmitch · Oct 10, 2020

I'm waiting for the official announcement on the 28th to see how the cards perform. I don't have any intentions of upgrading my GPU anytime soon, but would like to at least see how Big Navi performs.

It's nice reading some of the speculation, but the bickering back and forth seems pointless at this time.

TESKATLIPOKA · Oct 10, 2020

reb0rn said:
if you see rdn2 with 256bit is faster then 3070 with same bus, then it can, we just don`t know in which games it save enough and what is the limit memory bus or GPU

for sure miners are DEAD and will not buy AMD anymore as mining is strictly tied by raw memory bandwith and bus

also don't forget wider bus is a lot expansive PCB design and GPU side controller

But I was talking about 80CU Big Navi, you don't need that to compete with RTX 3070.

Question Speculation: RDNA2 + CDNA Architectures thread

Platinum Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Senior member

Diamond Member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Platinum Member

Senior member

Platinum Member

Senior member

Platinum Member

Diamond Member

Senior member

Member

Diamond Member

Lifer

Platinum Member

Senior member

Diamond Member

Platinum Member