Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

igor_kavinski · Jul 1, 2024

Det0x said:
Windows-key + G --> "This is a game" keeps the program running on CCD0 (no scheduling issues)
Need updated Gamebar and AMD windows driver package with balanced powerplan to work properly

It just sucks that AMD got tricked into promoting Microsoft's garbage software.

https://answers.microsoft.com/en-us...r/873f97d2-e5a7-4017-b2c4-c53e2dd18f2f?page=1

That Game Bar update annoyance only got fixed last December but there's no telling when it could cause problems in future. Also, knowing Microsoft, there's also no telling when they might decide to demote Game Bar to legacy software status and stop updating it altogether. Anyone remember GfW?

AMD ideally should've integrated this proper scheduling functionality into Ryzen Master.

Joe NYC · Jul 1, 2024

igor_kavinski said:
The supply of V-cache dies being a limiting factor, using two of them in a single CPU prevents the "birth" of another CPU.

We really don't know that to be the case. There has been 3 years of ramping of capacity, and V-Cache sales are still tiny. I really don't think capacity is a problem.

If capacity was still the bottleneck, meaning TSMC has only the capacity for SoIC stacking that is equal to AMD V-Cache sales, then TSMC can just as well scrap SoIC from there web side and give up.

Something TSMC is not doing...

Alternative (and far more plausible) theory is that AMD is demand constrained currently, in all its CPU products, not supply constrained.

QuickyDuck · Jul 1, 2024

Joe NYC said:
We really don't know that to be the case. There has been 3 years of ramping of capacity, and V-Cache sales are still tiny. I really don't think capacity is a problem.

If capacity was still the bottleneck, meaning TSMC has only the capacity for SoIC stacking that is equal to AMD V-Cache sales, then TSMC can just as well scrap SoIC from there web side and give up.

Something TSMC is not doing...

Alternative (and far more plausible) theory is that AMD is demand constrained currently, in all its CPU products, not supply constrained.

Just to remind you 3D V-Cache product isn't the only one leverages SoIC, MI300 series also rely on it.

IMO, there isn't dual X3D consumer cpu because marketing shenanigan.

Joe NYC · Jul 1, 2024

igor_kavinski said:
Look on eBay:

View attachment 102159

View attachment 102160

When the price difference between identical core count server CPUs due to V-cache is more than $2000, no way AMD is just gonna give away their V-cache dies. It also cuts into their server marketshare because people and even companies could start using the dual V-cache CPUs for their commercial workloads instead of investing in a server.

I think the mistake / wrong assumption you keep making is that there is some sort of shortage and there is a tradeoff AMD is constantly making due to shortage of capacity.

TSMC had plenty unused capacity for last 18-24 months. There may be some capacity tightness going into H2 in some nodes, but not in N6/N7, used to V-Cache, where TSMC is drowning in overcapacity, and unlikely in SoIC packaging, which TSMC has been ramping for 3 years now.

Joe NYC · Jul 1, 2024

QuickyDuck said:
Just to remind you 3D V-Cache product isn't the only one leverages SoIC, MI300 series also rely on it.

IMO, there isn't dual X3D consumer cpu because marketing shenanigan.

I don't think so. The reason there was no dual V-Cache 7950x3d is because there is a ~500 MHz clock speed deficit in V-Cache parts, and software that does not benefit from V-Cache just runs ~10% slower.

7950x3d was aimed to be the best of both worlds is the reason AMD did not put V-Cache on the 2nd CCD, and instead maximized the clock speed.

So there were some technical reasons behind the decision of having asymmetrical CPU. It the technical challenges (of clock speed regression) are overcome, then there is no longer reason to have a complicated asymmetrical CPU any more.

Joe NYC · Jul 1, 2024

QuickyDuck said:
Just to remind you 3D V-Cache product isn't the only one leverages SoIC, MI300 series also rely on it.

IMO, there isn't dual X3D consumer cpu because marketing shenanigan.

BTW, good point about Mi300.

But regarding Mi300, Lisa Su explicitly said that AMD is NOT capacity constrained (in H2 2024(, and can manufacture and sell a lot more than is currently being projected. Meaning, there is going to be spare capacity for Mi300, so Mi300 is not constraining the capacity for SoIC packaging.

tsamolotoff · Jul 1, 2024

igor_kavinski said:
It just sucks that AMD got tricked into promoting Microsoft's garbage software.

You don't need gamebar for this, there is https://github.com/cocafe/vcache-tray

If you don't need frequency cores to do the ST work, just switch it to 'prefer cache' and you are good to go. Also, you can configure exceptions if needed.

Tup3x · Jul 1, 2024

By the way, is there a reason why 8 core die with V-Cache + 6 core die is not possible?

igor_kavinski · Jul 1, 2024

Tup3x said:
By the way, is there a reason why 8 core die with V-Cache + 6 core die is not possible?

None I guess. But this is the reason I believe AMD is capacity constrained with regards to V-cache. If they had abundant supply, they would have a LOT more SKUs, like maybe so:

4-core V-cache single CCD (Ryzen 3 X3D!)
6-core V-cache single CCD (Ryzen 5 X3D!)
8-core V-cache CCD + 4 core CCD
8-core V-cache CCD + 6 core CCD
4-core V-cache CCD + 4-core V-cache CCD
and who knows how many more!

Please understand that V-cache is a big CACHE die, prone to defects more than logic dies. We have no idea what the V-cache yield rate is and the process of attaching V-cache to the CCD is no simple matter and causes production to slow down enough that they can only do something like 40,000 V-cache CPUs a month (last I heard. Not sure about their latest production figures).

RnR_au · Jul 1, 2024

Tup3x said:
By the way, is there a reason why 8 core die with V-Cache + 6 core die is not possible?

8 core vcache + 16 core 5c... in a laptop...

coercitiv · Jul 1, 2024

Tup3x said:
By the way, is there a reason why 8 core die with V-Cache + 6 core die is not possible?

Possible and probable, the two frenemies. Good 8 core dies already have their place in 8-core and 16-core SKUs, meanwhile 6-core and 12-core SKUs do a good job catching all kinds of imperfect dies.

8+6 is possible, but it ain't probable, the same could probably be said about 7+7 or 5+5.

igor_kavinski · Jul 1, 2024

coercitiv said:
the same could probably be said about 7+7 or 5+5.

For the 7 good cores CCD, they are probably disabling one good core to use these for 7600/7600X.

For 5 good cores CCD, we might see a future Ryzen 3 7400X/7400/7400F SKU with 4 core CCD once they have sufficient inventory built up for mass market launch.

coercitiv · Jul 1, 2024

igor_kavinski said:
For the 7 good cores CCD, they are probably disabling one good core to use these for 7600/7600X.

Like I said...

coercitiv said:
meanwhile 6-core and 12-core SKUs do a good job catching all kinds of imperfect dies.

PJVol · Jul 1, 2024

igor_kavinski said:
None I guess

AFAIK with previous Zen cpus, each CCD must be downcored by the same number of cores.
This applies to both soft and fused downcore

PJVol · Jul 1, 2024

Joe NYC said:
So there were some technical reasons behind the decision of having asymmetrical CPU. It the technical challenges (of clock speed regression) are overcome, then there is no longer reason to have a complicated asymmetrical CPU any more.

I'm not sure if two v-cache CCDs are needed at all.
If we assume they managed to get the V-cache CCD Fmax equal to that of regular CCDs, you'll still end up with inter-CCD penalty. Different CCD Fmax will make things even worse.

P.S. Unless there's a big single L3 chunk glued-TSV'ed atop of both CCDs, with 16 slices forming a common for both CCDs ring (or whatever) bus ? )

tsamolotoff · Jul 1, 2024

It's just a false impression (along with 'zen4 has thick ihs' FUD) that has sprang into life at x3d release because AMD for some reason decided to make the CCD1 as the high-priority by default in CPPC (despite the fact that according to ACPI the x3d cores have higher rating / quality / priority etc). I had to reinstall gamebar and related services three or four times to make it work and then I found out that it's like 3 or 4 games that I've owned that benefit from frequency rather than cache (mostly old games like RE5 and TF2), one of which (CSGO) was obsoleted a few months after x3d launch (and in cs2 x3d is 30%+ faster than vanilla zen4).

So in all, it'd be much better if AMD just kept CCD0 as the default high-priority chiplet and left the option to use CCD1 as the high-prio for those who need it (I don't know, people who need the highest scores in browser benchmarks or those who play old games)

igor_kavinski · Jul 1, 2024

PJVol said:
(unless there's a big single L3 chunk glued atop of both CCDs, with 16 slices formed common for both CCD ring-bus ?)

I recall adroc saying something to the effect that cache coherency issues would come into play with a single shared large L3 across both CCDs. Also, not even sure if they can do that single shared L3. There might be issues with that thin V-cache die spreading across the boundaries of the two CCDs.

Joe NYC · Jul 1, 2024

PJVol said:
I'm not sure if two v-cache CCDs are needed at all.
If we assume they managed to get the V-cache CCD Fmax equal to that of regular CCDs, you'll still end up with inter-CCD penalty (unless there's a big L3 chunk glued atop of both CCDs). Different CCD Fmax will make things even worse.

It's not much different from 7950x. The number of cases where 7700x is faster than 7950x is limited.

I am puzzled why many reviews analyze core to core latencies, as if that was a common usage. Vast majority of communication is core to memory, or core to L3 to Memory.

So in scenario where thread jumps to the other CCD, leaving its data behind, the thread just has to go back to memory to re-fetch the data.

Another bad scenario is if number of cores from different CCDs all work on the same data, in which case, L3 ends up not really being used and memory has to resolve these accesses.

These are 2 corner cases that may amount to low percentage of scenarios, probably low single digit. Most typical case is that a thread mostly uses its own data, and typically, it does not jump from CCD to CCD.

PJVol · Jul 1, 2024

igor_kavinski said:
I recall adroc saying something to the effect that cache coherency issues would come into play with a single shared large L3 across both CCDs.

If two CCDs are placed very close and are restricted to run at the same frequency, idk why there should be any issues. Also, the cores are probably need to be arranged as close as possible to one of the chiplet edges.

But anyway, I agree with Joe NYC's thoughts above.

igor_kavinski · Jul 1, 2024

Joe NYC said:
I am puzzled why many reviews analyze core to core latencies, as if that was a common usage. Vast majority of communication is core to memory, or core to L3 to Memory.

It's due to user threads not being able to execute code without the involvement of a kernel thread. The kernel thread is in control of allowing the user thread to execute. So there's a LOT of inter-thread communication going on and core to core latencies need to be low to reduce that communication overhead.

Gideon · Jul 1, 2024

PJVol said:
If two CCDs are placed very close and are restricted to run the same frequency, idk why there should be any issues

The issue is Server chips that have more than 2 CCDs IMO. All this extra complexity, that only really helps desktop usecases with 2 CCDs, makes it really unlikely AMD would design something like that.

With Zen 6, where presumably more complex chiplet layouts are used and Client is separated from Server, we are more likely to see such setups. A Strix-Halo like chip, with a huge LLC insted of a supersized GPU, etc ...

igor_kavinski · Jul 1, 2024

This will shed more light on the inter-core communication part: https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/

It's not possible to achieve those highlighted requirements without constant inter-thread communication.

igor_kavinski · Jul 1, 2024

Joe NYC said:
Most typical case is that a thread mostly uses its own data, and typically, it does not jump from CCD to CCD.

That's the problem. It's working on data in the shared memory space and the different threads have to behave by communicating with each other so that one thread does not corrupt any other thread's data in that shared memory space.

tsamolotoff · Jul 1, 2024

Cache and memory coherency (as well as bandwidth in case of zen4) is not an issue if the OS and/or software you use is NUMA (NUCA in this case) aware, but most of the times it's not. Also, coherency issues that tank performance in games typically appear when the game tries to use threads above one CCD limit in a latency bound scenarios (check FPS in cs2 with threads affinity locked only to ccd0 and in 'normal' mode on 7950x3d, for example) either on purpose or by chance or mistake (like some old games, stalker cop for example).

Also, if solution was an obvious thing (like doing more cores in a CCX and putting them onto one (bidrectional) ringbus), then both Intel and AMD wouldn't expend money and effort on mesh interconnect and IF achitectures.

JustViewing · Jul 1, 2024

igor_kavinski said:
It's not possible to achieve those highlighted requirements without constant inter-thread communication.

Not really, it depends on application. The data I have been working on have large chunk of data for each thread. There will not be any inter process communication until the work is completed.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Lifer

Platinum Member

Junior Member

Platinum Member

Platinum Member

Platinum Member

Member

Golden Member

Lifer

Platinum Member

Diamond Member

Lifer

Diamond Member

Senior member

Senior member

Member

Lifer

Platinum Member

Senior member

Lifer

Golden Member

Lifer

Lifer

Member

Senior member