Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

JustViewing · Monday at 8:27 AM

Joe NYC said:
These are 2 corner cases that may amount to low percentage of scenarios, probably low single digit. Most typical case is that a thread mostly uses its own data, and typically, it does not jump from CCD to CCD.

I agree.

igor_kavinski · Monday at 8:33 AM

JustViewing said:
Not really, it depends on application. The data I have been working on have large chunk of data for each thread. There will not be any inter process communication until the work is completed.

How do you partition the data? Just give a pointer to the whole data structure to each thread and make sure they don't do anything out of bounds? Or is your workload read-only meaning the thread doesn't need to update the data while it's working on it?

But yes, if an application's very nature requires inter-thread communication, it will get screwed with high latencies. I don't know what type of multithreading game engines do. I think the type of workload you are doing maybe corresponds more to something like CBR23.

fastandfurious6 · Monday at 9:01 AM

Isn't it true that there is a large discrepancy between software development and cpu architectures?

basically non-existent alignment between software and hardware

the only programmers doing low level stuff are for embedded, weak and feeble cpus due to limited resources

every other modern software is written in high level languages, huge abstractions, which translate into a mess in actual IPC

general purpose cpus are literally that, they take a gamble on performance, caches are half blind etc

can you imagine how fast real world perf would be if there was better alignment between software and hardware?

tsamolotoff · Monday at 9:45 AM

fastandfurious6 said:
the only programmers doing low level stuff are for embedded, weak and feeble cpus due to limited resources

No? It's fairly important for HPC as well. Your calculations won't scale too much unless you code your program in a fashion that's initially NUMA-aware, MPI(ch)-aware and your thread overhead is kept at minimal values. It's just game development is currently at the bottom of the barrel as salaries and crunch are not conducive to attracting the talents who know what a pointer is and who can actually program things instead of doing proverbial 'match a square hole' thing with shapes made of managed code etc.

JustViewing · Monday at 10:32 AM

igor_kavinski said:
How do you partition the data? Just give a pointer to the whole data structure to each thread and make sure they don't do anything out of bounds? Or is your workload read-only meaning the thread doesn't need to update the data while it's working on it?

But yes, if an application's very nature requires inter-thread communication, it will get screwed with high latencies. I don't know what type of multithreading game engines do. I think the type of workload you are doing maybe corresponds more to something like CBR23.

If you are working in low level languages, you need to create work-groups (a method + pointer to data + event signal) and make them run as threads. then you wait for all/single work to complete. When a thread completes a task it will make a signal to notify.
In high level languages .Net, it is very easy to do auto multi- threading, you create a list of work items and ask .net runtime time run it in parallel. The key here is, there is no communication between threads or parent until the work is completed.

In essence you are breaking a work in to multiple work blocks and run them in parallel. Then combine the results, and continue with the process.

Personally I haven't faced with a situation where there needs to be constant communication between threads when before work is completed. Even when communication is required it is done through signaling. The real issue is, most work cannot be broken up, that is there is a linear dependency.

Josh128 · Monday at 10:42 AM

New Strix GB6 runs from 7/1/24. New 370 run reports max clock of 5098, but the .gb6 list shows nothing higher than ~3.98 ish. I gave up on trying to pin down IPC in GB. Its going to be somewhere between 580 and 700 points per GHz, lol.

HX 370 (reported)

Acer Swift SF14-61 - Geekbench

Benchmark results for an Acer Swift SF14-61 with an AMD Ryzen AI 9 HX 370 w/ Radeon 890M processor.

browser.geekbench.com

HX 170 (reported)

ASUSTeK COMPUTER INC. ProArt P16 H7606WV_H7606WV_00092521C - Geekbench

Benchmark results for an ASUSTeK COMPUTER INC. ProArt P16 H7606WV_H7606WV_00092521C with an AMD Ryzen AI 9 HX 170 w/ Radeon 880M processor.

browser.geekbench.com

Joe NYC · Monday at 10:43 AM

igor_kavinski said:
This will shed more light on the inter-core communication part: https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/

View attachment 102189
It's not possible to achieve those highlighted requirements without constant inter-thread communication.

Good table there.

In case of a client CPU with 2 CCDs, how likely is it that the demand on the CPU needs the 2nd CCD? Then when it does spill over one CCD, suppose a process starts on one CCD, then all its threads (likely to share the same data) will have affinity to the same CCD,. Something like that may be part of the AMD CPU driver and OS scheduler.

fastandfurious6 · Monday at 11:14 AM

JustViewing said:
If you are working in low level languages

if....

and then the rest you describe about threads/multithreading is still not close enough at all to hardware

TBH multithreading is fairly basic, yet programmers treat it as OMG so advanced so difficult

I'm talking about real optimization of how software works even with L1/L2/L3 caches

for example, does the CPU actually store the right data in the caches to have maximum hit rate?

all this intrinsic CPU functionality is behind a black box, behind the "general purpose" computing package of the modern cpu

P.S. also, what I'm trying to describe is exactly the reason why Nvidia is the "richest" company by market cap in the world right now

Timmah! · Monday at 11:18 AM

PJVol said:
I'm not sure if two v-cache CCDs are needed at all.
If we assume they managed to get the V-cache CCD Fmax equal to that of regular CCDs, you'll still end up with inter-CCD penalty. Different CCD Fmax will make things even worse.

P.S. Unless there's a big single L3 chunk glued-TSV'ed atop of both CCDs, with 16 slices forming a common for both CCDs ring (or whatever) bus ? )

I presume if the application can benefit from more than 8 cores, running on 16C despite the interCCD penalty is still beneficial, no?
And anyway, even its still most beneficial to stick to just cores on single die to avoid that penalty, just the fact all the cores are equal in this regard and no additional scheduling is required would be worth it imo. Certainly if the price is just another hundred bucks higher.

JustViewing · Monday at 11:20 AM

fastandfurious6 said:
if....

and then the rest you describe about threads/multithreading is still not close enough at all to hardware

TBH multithreading is fairly basic, yet programmers treat it as OMG so advanced so difficult

I'm talking about real optimization of how software works even with L1/L2/L3 caches

for example, does the CPU actually store the right data in the caches to have maximum hit rate?

all this intrinsic CPU functionality is behind a black box, behind the "general purpose" computing package of the modern cpu

P.S. also, what I'm trying to describe is exactly the reason why Nvidia is the "richest" company by market cap in the world right now

True, you have to treat single work item as a single thread application.

tsamolotoff · Monday at 11:56 AM

Timmah! said:
no additional scheduling is required would be worth it imo. Certainly if the price is just another hundred bucks higher.

No, it'd be required, because cache is divided into two segments that are treated as a whole by OS and applications that try to sync across whole CPU. For example, even if both your chiplets have extra cache, fps in CS2 will still get tanked as the game will try to use both CCDs at the same time. Same goes for Warzone and Call of Duty if the worker thread count is exceeding the limit of one chiplet. Every other 'scheduling' (which it is not, actually) issue can be fixed by setting 'prefer cache' option in the BIOS or via registry.

igor_kavinski · Monday at 12:09 PM

Joe NYC said:
In case of a client CPU with 2 CCDs, how likely is it that the demand on the CPU needs the 2nd CCD?

I'll try to answer that soon (I hope but soon for me these days ranges anywhere from a day to a week) with my 8 CCD Zen 2 CPU

Will tag you in the post.

Mopetar · Monday at 12:37 PM

fastandfurious6 said:
can you imagine how fast real world perf would be if there was better alignment between software and hardware?

The software that exists would be a much smaller subset of what we have available today. Faster is great, but having "slow" software is better than having no software.

The whole reason we don't aim for alignment with the hardware is that the hardware is constantly evolving and we don't want to lock ourselves in to a single platform or a single vendor on it.

High-level languages were developed so that you could port your programs over to new or different architectures. The software written in x86 assembly and tuned for a specific CPU might be really fast on that one chip, but it might lose performance on the next CPU just because the hardware is built differently in a way that the software clashes with. It's why AMD moved away from VLIW with their GPUs.

We added additional abstraction layers on top of that for similar reasons. As in I do t care what CPU your Windows PC is running on, I want to develop something that doesn't have to care about which version or even if it's running Windows at all. I don't want to maintain 30 separate code bases so that my program runs fast on different versions of Windows for multiple versions on different architectures across different vendors. No, give me an abstraction layer. The difference is between reasonably fast software or none at all.

CakeMonster · Monday at 12:45 PM

New ASUS beta BIOS with AGESA 1.2.0.0 is out:

Re: X670 resource

ROG CROSSHAIR/STRIX/PROART X670E/B650 Series Beta Bios 2120 3005 1) Update AGESA to AMD ComboAM5 1200 2) Improve system performance dont use older cmo file ROG CROSSHAIR X670E HERO BETA BIOS 2120 https://drive.google.com/file/d/1_2hYquHzK9GWJGiESg4hWvtkyzNitTwQ/view?usp=sharing ROG CROSSHAIR...

rog-forum.asus.com

This is IIRC the third version with 'new CPU' support, so we're getting there...

igor_kavinski · Monday at 12:52 PM

CakeMonster said:
This is IIRC the third version with 'new CPU' support, so we're getting there...

Adding the CPU SKUs one at a time, are we, AMD?

Fjodor2001 · Monday at 4:27 PM

I know some think the Win12 AI PC 40 TOPS requirement is BS on DT.

But just out of curiosity, do you think Zen5 will be able to brute force that requirement using AVX512, or some other generic instruction set? On top-end 9950X only, or even on low-end 9600X?

If not, do you think there'll be some 40+ TOPS NPU that can be added via PCIe extension card to fulfill the requirement (i.e. not using generic discrete GPU)?

Or will AMD give up Win12 AI PC on DT using CPU alone, and hand over that market segment to Intel which is known to fulfill it on Arrow Lake Desktop CPUs?

Nothingness · Monday at 5:00 PM

Fjodor2001 said:
I know some think the Win12 AI PC 40 TOPS requirement is BS on DT.

But just out of curiosity, do you think Zen5 will be able to brute force that requirement using AVX512, or some other generic instruction set? On top-end 9950X only, or even on low-end 9600X?

The math: 2 op (mul+add) * 64 (int8) * nb 512-bit MLA units * nb cores * frequency.
Assuming 4x512b MLA (doubtful), 8 cores and 5Ghz: 2*64*4*8*5=20 TOPS
I doubt even the 16-core 9950X will reach that.

StefanR5R · Monday at 5:02 PM

Fjodor2001 said:
do you think there'll be some 40+ TOPS NPU that can be added via PCIe extension card to fulfill the requirement

That would be an expensive way to obtain the right to put a certain sticker on a low margin product.

PS, curious that MS defines requirements in terms of TOPS and of presence of the Copilot key on the keyboard, but not in terms of RAM size which is not entirely unimportant for local inferencing. (Also, RAM bandwidth?)

RnR_au · Monday at 8:28 PM

Nothingness said:
The math: 2 op (mul+add) * 64 (int8) * nb 512-bit MLA units * nb cores * frequency.
Assuming 4x512b MLA (doubtful), 8 cores and 5Ghz: 2*64*4*8*5=20 TOPS
I doubt even the 16-core 9950X will reach that.

Some are doing inferencing with 4 bit floating point. But not sure how that works on AVX512 if at all.

jamescox · Monday at 10:42 PM

I don't have time to read this whole thread. I was wondering if anyone has done or seen a die size analysis for zen5c? It seemed weird that initial official pictures were all at an angle while the regular zen5 was straight top down photo, but looked like they showed a top down photo in the computex presentation. It still seems like the die to too long and narrow. I guess it may just be that the infinity fabric layout was changed significantly to fit it on the package. Does anything else make sense though? Would an npu make sense for zen5c? I assume there are a lot of servers just doing inferencing, so it seems like a cpu with npu may make sense for lowest possible power consumption.

Hitman928 · Monday at 10:58 PM

jamescox said:
I don't have time to read this whole thread. I was wondering if anyone has done or seen a die size analysis for zen5c? It seemed weird that initial official pictures were all at an angle while the regular zen5 was straight top down photo, but looked like they showed a top down photo in the computex presentation. It still seems like the die to too long and narrow. I guess it may just be that the infinity fabric layout was changed significantly to fit it on the package. Does anything else make sense though? Would an npu make sense for zen5c? I assume there are a lot of servers just doing inferencing, so it seems like a cpu with npu may make sense for lowest possible power consumption.

I got ~3.2 mm2 with L2 included IIRC. I can double check that number tomorrow.

jamescox · Monday at 11:03 PM

igor_kavinski said:
None I guess. But this is the reason I believe AMD is capacity constrained with regards to V-cache. If they had abundant supply, they would have a LOT more SKUs, like maybe so:

4-core V-cache single CCD (Ryzen 3 X3D!)
6-core V-cache single CCD (Ryzen 5 X3D!)
8-core V-cache CCD + 4 core CCD
8-core V-cache CCD + 6 core CCD
4-core V-cache CCD + 4-core V-cache CCD
and who knows how many more!

Please understand that V-cache is a big CACHE die, prone to defects more than logic dies. We have no idea what the V-cache yield rate is and the process of attaching V-cache to the CCD is no simple matter and causes production to slow down enough that they can only do something like 40,000 V-cache CPUs a month (last I heard. Not sure about their latest production figures).

It doesn't make sense to put v-cache on low end consumer cpu packages. That wouldn't be very profitable. They get the best return on the higher end parts, although that probably still low compared to Epyc v-cache parts. The 16-core epyc x-series is listed at $4928, the 32-core is listed at $5529, and the 96-core is almost $14,756. The 4-core dies with v-cache are going into the the Epyc 9384X (8 CCD * 4 active cores; 768 MB of L3).

Aeonsim · 2024-07-02T05:10:33-0400

For potentially radical things they could do with a V-cache die:

1) Double stack them or increase the size, one CPU core with two attached V-cache blocks, how does 128MB of v-cache sound?
2) Use one as a system level cache attached to an IO die, with or without one or two attached to the cores as L3 cache (lots of modern designs have a system cache)?

In general, guys, I think many of you are undervaluing the marketing value of having the 'fastest' CPU (for gaming) on the planet at any one time. Clearly, AMD makes decent money from the X3D series, as can be determined by the number of different X3D SKUs they offer beyond the server space.

5600X3D
5700X3D
5800X3D
7800X3D
7900X3D
7950X3D

I also believe there was an interview/tour of one of the AMD labs where prototype X3D chips with double X3D dies were mentioned, so they've thought about it before.

tsamolotoff · 2024-07-02T05:30:15-0400

Aeonsim said:
potentially radical things

- monolithic die with dual GMI
- x3d stack that is produced with the same tech process as the base die so it won't go poof on high-V
- sanding the dies properly and avoiding using some sort of zero thermal conductivity 'glue' to attach 'structural silicon' (or maybe placing the x3d tile under the actual core die as cache does not really produce any sort of substantial heat output)

But we know AMD loves to penny pinch so no luck here 💲💲

Nothingness · 2024-07-02T05:31:02-0400

RnR_au said:
Some are doing inferencing with 4 bit floating point. But not sure how that works on AVX512 if at all.

Yes, TOPS is such an ill-defined term you never know what it means.
I don't think AVX-512 has any support for 4-bit formats.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Member

Lifer

Member

Member

Member

Junior Member

Platinum Member

Member

Golden Member

Member

Member

Lifer

Diamond Member

Golden Member

Lifer

Diamond Member

Platinum Member

Elite Member

Golden Member

Senior member

Diamond Member

Senior member

Junior Member

Member

Platinum Member