Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

Glo. · Feb 5, 2024

igor_kavinski said:
Why couldn't it be latency?

Could also be driver overhead. Remember, the driver has to share system RAM with the CPU's other running processes. By the time it receives its quantum slice, the data it has may already be stale, needing to pull more which incurs another latency penalty waiting on the higher latency of LPDDR5 chips. The iGPU really needs its own slice of dedicated cache (256MB if possible).

Neither latency, nor driver overhead explains the small drop in performance for 760M relative to the drop in CU count.

igor_kavinski · Feb 5, 2024

See this:

https://www.reddit.com/r/ROGAlly/comments/1860wo6

Glo. · Feb 5, 2024

igor_kavinski said:
See this:
https://www.reddit.com/r/ROGAlly/comments/1860wo6

Confirming what I wrote.

Neither latency, nor driver overhead will explain 15% drop while we cut 33% of computational power.

L2 cache is bigger than it is needed for low-latency operation in 768 ALUs.

StefanR5R · Feb 5, 2024

igor_kavinski said:
See this:
https://www.reddit.com/r/ROGAlly/comments/1860wo6

This guy writes "Well it’s not really a straight forward answer because there’s an abnormally large L2 adding a ton of effective bandwidth and the “standard” bandwidth number of just giving max bandwidth of the memory chips over a 128 bit bus doesn’t really give much info."
And where is the analysis of workload footprints to back this claim?
"Still just the memory bandwidth is roughly 95-100GB/S which is similar to a 1050"
Except that the GTX 1050 doesn't share this with a CPU which itself needs to access the memory all over the place in games and various other (e.g. interactive) workloads.
Edit, PS:
GTX 1050 --> 1.9 TFLOPS peak*
Radeon 780M --> 8.3 TFLOPS peak*

EDIT half a day later, *) that's of course theoretical peak FP32 throughput [ = if the units could be fed continuously, and maybe the operations mix plays a role too]

soresu · Feb 5, 2024

StefanR5R said:
GTX 1050 --> 1.9 TFLOPS peak
Radeon 780M --> 8.3 TFLOPS peak

AMD doubled their FLOPS numbers for RDNA3 to denote the 'dual issue' CU change, but the reality of the µArch's throughput for most gaming workloads isn't anywhere near that.

I guess we'll see soon enough if RDNA3.5 or RDNA4 make any significant changes in that area.

eek2121 · Feb 5, 2024

Philste said:
Because AMDs APUs are Bandwith starved for 2 Years now?! It started with Rembrandt and now Phoenix is worse. Pheonix iGPU is clocked over 15% higher than Rembrandts and uses RDNA3, meanwhile it's not even 10% faster with same RAM and barely 15% with faster RAM. Also OEMs will always cheap out on faster RAM, so don't expect that every device comes with LPDDR5x-8533. New Tests of Desktop Versions also show that 7200 brings nearly no difference compared to 5200 (only about 7%). All this leads to the conclusion that Strix will be Bandwith starved to the moon.

Is it? Originally I had no horse in this race because I only have one use case for an APU. However, are there tools out there to measure GPU memory utilization? You actually have me tempted to run some tests.

Glo. said:
Also, 760M is just 15% slower than 780M while having 33% less CUs, than 780M.

If this is not memory starvation - I do not know what is.

That could be any number of factors. A big one is thermals and power consumption/limits, though a look at clocks should make that pretty obvious. There are other items as well. I don’t think we have enough data to reach the conclusion that these chips are bandwidth starved, especially since APUs today have far more bandwidth available vs. the past.

Even if AMD is hitting bandwidth limits, there are a few ways to get around that limit.

eek2121 · Feb 5, 2024

StefanR5R said:
This guy writes "Well it’s not really a straight forward answer because there’s an abnormally large L2 adding a ton of effective bandwidth and the “standard” bandwidth number of just giving max bandwidth of the memory chips over a 128 bit bus doesn’t really give much info."
And where is the analysis of workload footprints to back this claim?
"Still just the memory bandwidth is roughly 95-100GB/S which is similar to a 1050"
Except that the GTX 1050 doesn't share this with a CPU which itself needs to access the memory all over the place in games and various other (e.g. interactive) workloads.
Edit, PS:
GTX 1050 --> 1.9 TFLOPS peak
Radeon 780M --> 8.3 TFLOPS peak

Possibly, but I disagree with the 95-100GB/s number. DDR5 has been tested on an AMD APU by an overclocker at a speed of DDR5-10600, and the number will likely rise up from there. The only limit that one has to worry about is the limits of AMD silicon. Memory chip makers are actually working on much faster DDR5, and LPDDR5. LPDDR5 12667 and beyond should be out within the next 1-2 years or so. DDR6 will accelerate this.

It is also possibly to use GDDR6 for system memory, though neither Intel or AMD have shown any desire to do this outside of consoles. GDDR6 has higher latency, but consoles use it and games run fine.

Glo. · Feb 5, 2024

eek2121 said:
That could be any number of factors. A big one is thermals and power consumption/limits, though a look at clocks should make that pretty obvious. There are other items as well. I don’t think we have enough data to reach the conclusion that these chips are bandwidth starved, especially since APUs today have far more bandwidth available vs. the past.

Even if AMD is hitting bandwidth limits, there are a few ways to get around that limit.

The biggest factor is sharing memory bandwidth with the CPU.

Thats the biggest one.

Currently - it will be impossible to feed fully 8 CPU cores, and 12 CUs with 128 bit bus.
760M only appears in 8600G, a CPU with 6 CPU cores. The bandwidth requirement are cut in 25% for the CPU cores, and in 33% for the CUs.

Hence why 760M is only 15% slower than 780M, despite having GPU core count cut in 33%.

Now, imagine Strix Point Zen 5 cores. There is 12 of them. What will be the bandwdith requirements for them, and for the GPU, which will have 33% more CUs than 780M?

It will be even bigger challenge.

soresu · Feb 5, 2024

igor_kavinski said:
That's actually how much my boss makes and I starve in comparison, with crappy health insurance and not enough to get my own apartment with a proper lease agreement.

And oh, you should watch him type a formula in Excel. You will want to kill yourself.

I feel you.

That was me for 6 years watching the deputy manager swan about in my last long term job.

Utterly useless waste of space that made about 50% more than my direct boss who worked 60+ hours.

The guy took over 2 months off for a golf injury...... it just makes my blood boil.

soresu · Feb 5, 2024

I wonder if there is any possibility that Strix Halo will have a GDDR6 eDRAM stack on the package.

If they already have a die with some infinity cache on it then it doesn't seem a huge stretch that they might add DRAM into that stack too.

NTMBK · Feb 5, 2024

igor_kavinski said:
Why couldn't it be latency?

Could also be driver overhead. Remember, the driver has to share system RAM with the CPU's other running processes. By the time it receives its quantum slice, the data it has may already be stale, needing to pull more which incurs another latency penalty waiting on the higher latency of LPDDR5 chips. The iGPU really needs its own slice of dedicated cache (256MB if possible).

GPUs are designed to be very latency tolerant. They have large thread counts they swap out to avoid stalling. They intentionally focus on throughout over latency.

DrMrLordX · Feb 5, 2024

FlameTail said:
So I guess Strix Halo could fit into the chassis of a Macbook Pro equivalent Windows Laptop.

Yeah but not anything like Macbook Air. Strix Halo is not going to be a 15-25W chip.

NTMBK · Feb 5, 2024

I really hope we see 256-bit APUs in the next generation. With LPCAMM it's not a ridiculous idea, and we'd finally be able to see proper replacement of most laptop GPUs. The 128-bit bus has held back AMD ever since the original Llano APU, and it's basically dictated by how many SODIMMs you can fit on a laptop board. Time to blow the doors off already!

Glo. · Feb 5, 2024

NTMBK said:
I really hope we see 256-bit APUs in the next generation. With LPCAMM it's not a ridiculous idea, and we'd finally be able to see proper replacement of most laptop GPUs. The 128-bit bus has held back AMD ever since the original Llano APU, and it's basically dictated by how many SODIMMs you can fit on a laptop board. Time to blow the doors off already!

Way due past that we get 256 bit on mainstream platforms.

FlameTail · Feb 5, 2024

On-package memory is the way to go.

It will enable very wide RAM buses (above 256 bit), without blowing up the motherboard cost.

itsmydamnation · Feb 5, 2024

FlameTail said:
On-package memory is the way to go.

It will enable very wide RAM buses (above 256 bit), without blowing up the motherboard cost.

yuck , if it was so attractive consoles would do it

S'renne · Feb 5, 2024

itsmydamnation said:
yuck , if it was so attractive consoles would do it

I mean Apple and soon Lunar Lake are doing so...

itsmydamnation · Feb 6, 2024

S'renne said:
I mean Apple and soon Lunar Lake are doing so...

Yes let make our hot thing hotter by putting hot things right next to it/ on top of it.

Would rather big SLC cache with smaller memory bus. Will win on power as well as perf, if it's big enough to get a good hit rate.

adroc_thurston · Feb 6, 2024

soresu said:
I wonder if there is any possibility that Strix Halo will have a GDDR6 eDRAM stack on the package.

If they already have a die with some infinity cache on it then it doesn't seem a huge stretch that they might add DRAM into that stack too.

no.
more tiers == more tears.
DRAM doesn't work as a cache anyway.

S'renne said:
I mean Apple and soon Lunar Lake are doing so...

That's a total mobo area play.
tablet chips duh.

soresu · Feb 6, 2024

adroc_thurston said:
DRAM doesn't work as a cache anyway.

I didn't say anything about using GDDR as a cache.

I meant solely as VRAM that is just on package, and separate from the main system LPDDR/DDR memory for the CPU cores.

Or I guess they could just build the IO pins into it instead, and leave the GDDR for OEMs to put on the mobo.

DrMrLordX · Feb 6, 2024

itsmydamnation said:
Yes let make our hot thing hotter by putting hot things right next to it/ on top of it.

Depends, it can make things simpler depending on form factor. As it currently stands, cooling DRAM on a desktop motherboard is simply not easy to do. Yes most DRAM doesn't get that hot, but if it does, you have big goofy heatspreaders and then you're relying on internal case airflow to do the rest, which does not yield great results. The only reason why it's not a huge problem for desktop is that most pedestrian setups simply do not push much power or produce much heat with their system RAM. For OEMs it's a bit of a headache since most OEMs are not big on internal airflow (or fan noise), but having DIMMs or even SIMMs in your setup makes it difficult to integrate RAM cooling with the rest of your cooling setup.

Moving a few extra watts on-package is not going to be a big problem if you know the heat is going to be there and have the ability to cool it. On-package RAM would have relatively large area/power ratio compared to compute chiplets running AVX512 or SVE2 code (for example). If SoCs using on-package RAM aren't forced to deal with klunky/bad integrated heat spreaders, the larger package can get direct contact with a cooling plate, making it just as easy to cool on-package RAM as it is to cool RAM on dGPUs.

CakeMonster · Feb 6, 2024

AMD To Release X870E Chipset and Ryzen 9000 Granite Ridge CPUs (Guru3d)

NTMBK · Feb 6, 2024

On package RAM isn't a totally new idea. Intel already shipped it in the Kabylake+Vega wacky product they made back in 2018:

So yes, putting dedicated VRAM on package is certainly doable. But I still think that configurable, user replaceable LPCAMM is the way forward- 256 bits of LPDDR5X should be plenty of bandwidth for any APU that fits in a laptop thermal envelope.

S'renne · Feb 6, 2024

NTMBK said:
On package RAM isn't a totally new idea. Intel already shipped it in the Kabylake+Vega wacky product they made back in 2018:

So yes, putting dedicated VRAM on package is certainly doable. But I still think that configurable, user replaceable LPCAMM is the way forward- 256 bits of LPDDR5X should be plenty of bandwidth for any APU that fits in a laptop thermal envelope.

Can 256 bit LPDDR5X feed the potential 40 CU RDNA 3.5 iGPU though? The 6700XT itself has 384 GB/s bandwidth..

Abwx · Feb 6, 2024

CakeMonster said:
AMD To Release X870E Chipset and Ryzen 9000 Granite Ridge CPUs (Guru3d)

Their source is MLID.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Diamond Member

Lifer

Diamond Member

Elite Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Platinum Member

Lifer

Lifer

Lifer

Diamond Member

Diamond Member

Platinum Member

Member

Platinum Member

Diamond Member

Platinum Member

Lifer

Golden Member

Lifer

Member

Lifer