Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 600 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Jul 27, 2020
17,925
11,695
116
Not really, it depends on application. The data I have been working on have large chunk of data for each thread. There will not be any inter process communication until the work is completed.
How do you partition the data? Just give a pointer to the whole data structure to each thread and make sure they don't do anything out of bounds? Or is your workload read-only meaning the thread doesn't need to update the data while it's working on it?

But yes, if an application's very nature requires inter-thread communication, it will get screwed with high latencies. I don't know what type of multithreading game engines do. I think the type of workload you are doing maybe corresponds more to something like CBR23.
 
Jun 1, 2024
44
24
36
Isn't it true that there is a large discrepancy between software development and cpu architectures?

basically non-existent alignment between software and hardware

the only programmers doing low level stuff are for embedded, weak and feeble cpus due to limited resources

every other modern software is written in high level languages, huge abstractions, which translate into a mess in actual IPC

general purpose cpus are literally that, they take a gamble on performance, caches are half blind etc

can you imagine how fast real world perf would be if there was better alignment between software and hardware?
 
Reactions: igor_kavinski

tsamolotoff

Member
May 19, 2019
62
95
91
the only programmers doing low level stuff are for embedded, weak and feeble cpus due to limited resources
No? It's fairly important for HPC as well. Your calculations won't scale too much unless you code your program in a fashion that's initially NUMA-aware, MPI(ch)-aware and your thread overhead is kept at minimal values. It's just game development is currently at the bottom of the barrel as salaries and crunch are not conducive to attracting the talents who know what a pointer is and who can actually program things instead of doing proverbial 'match a square hole' thing with shapes made of managed code etc.
 

JustViewing

Member
Aug 17, 2022
163
276
96
How do you partition the data? Just give a pointer to the whole data structure to each thread and make sure they don't do anything out of bounds? Or is your workload read-only meaning the thread doesn't need to update the data while it's working on it?

But yes, if an application's very nature requires inter-thread communication, it will get screwed with high latencies. I don't know what type of multithreading game engines do. I think the type of workload you are doing maybe corresponds more to something like CBR23.
If you are working in low level languages, you need to create work-groups (a method + pointer to data + event signal) and make them run as threads. then you wait for all/single work to complete. When a thread completes a task it will make a signal to notify.
In high level languages .Net, it is very easy to do auto multi- threading, you create a list of work items and ask .net runtime time run it in parallel. The key here is, there is no communication between threads or parent until the work is completed.

In essence you are breaking a work in to multiple work blocks and run them in parallel. Then combine the results, and continue with the process.

Personally I haven't faced with a situation where there needs to be constant communication between threads when before work is completed. Even when communication is required it is done through signaling. The real issue is, most work cannot be broken up, that is there is a linear dependency.
 

Josh128

Junior Member
Oct 14, 2022
2
8
41
New Strix GB6 runs from 7/1/24. New 370 run reports max clock of 5098, but the .gb6 list shows nothing higher than ~3.98 ish. I gave up on trying to pin down IPC in GB. Its going to be somewhere between 580 and 700 points per GHz, lol.

HX 370 (reported)

HX 170 (reported)
 

Joe NYC

Platinum Member
Jun 26, 2021
2,331
2,942
106
This will shed more light on the inter-core communication part: https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/

View attachment 102189
It's not possible to achieve those highlighted requirements without constant inter-thread communication.

Good table there.

In case of a client CPU with 2 CCDs, how likely is it that the demand on the CPU needs the 2nd CCD? Then when it does spill over one CCD, suppose a process starts on one CCD, then all its threads (likely to share the same data) will have affinity to the same CCD,. Something like that may be part of the AMD CPU driver and OS scheduler.
 
Jun 1, 2024
44
24
36
If you are working in low level languages

if....

and then the rest you describe about threads/multithreading is still not close enough at all to hardware

TBH multithreading is fairly basic, yet programmers treat it as OMG so advanced so difficult

I'm talking about real optimization of how software works even with L1/L2/L3 caches

for example, does the CPU actually store the right data in the caches to have maximum hit rate?

all this intrinsic CPU functionality is behind a black box, behind the "general purpose" computing package of the modern cpu



P.S. also, what I'm trying to describe is exactly the reason why Nvidia is the "richest" company by market cap in the world right now
 

Timmah!

Golden Member
Jul 24, 2010
1,463
729
136
I'm not sure if two v-cache CCDs are needed at all.
If we assume they managed to get the V-cache CCD Fmax equal to that of regular CCDs, you'll still end up with inter-CCD penalty. Different CCD Fmax will make things even worse.

P.S. Unless there's a big single L3 chunk glued-TSV'ed atop of both CCDs, with 16 slices forming a common for both CCDs ring (or whatever) bus ? )
I presume if the application can benefit from more than 8 cores, running on 16C despite the interCCD penalty is still beneficial, no?
And anyway, even its still most beneficial to stick to just cores on single die to avoid that penalty, just the fact all the cores are equal in this regard and no additional scheduling is required would be worth it imo. Certainly if the price is just another hundred bucks higher.
 

JustViewing

Member
Aug 17, 2022
163
276
96
if....

and then the rest you describe about threads/multithreading is still not close enough at all to hardware

TBH multithreading is fairly basic, yet programmers treat it as OMG so advanced so difficult

I'm talking about real optimization of how software works even with L1/L2/L3 caches

for example, does the CPU actually store the right data in the caches to have maximum hit rate?

all this intrinsic CPU functionality is behind a black box, behind the "general purpose" computing package of the modern cpu



P.S. also, what I'm trying to describe is exactly the reason why Nvidia is the "richest" company by market cap in the world right now
True, you have to treat single work item as a single thread application.
 

tsamolotoff

Member
May 19, 2019
62
95
91
no additional scheduling is required would be worth it imo. Certainly if the price is just another hundred bucks higher.
No, it'd be required, because cache is divided into two segments that are treated as a whole by OS and applications that try to sync across whole CPU. For example, even if both your chiplets have extra cache, fps in CS2 will still get tanked as the game will try to use both CCDs at the same time. Same goes for Warzone and Call of Duty if the worker thread count is exceeding the limit of one chiplet. Every other 'scheduling' (which it is not, actually) issue can be fixed by setting 'prefer cache' option in the BIOS or via registry.
 
Reactions: Timmah!

Mopetar

Diamond Member
Jan 31, 2011
8,005
6,451
136
can you imagine how fast real world perf would be if there was better alignment between software and hardware?

The software that exists would be a much smaller subset of what we have available today. Faster is great, but having "slow" software is better than having no software.

The whole reason we don't aim for alignment with the hardware is that the hardware is constantly evolving and we don't want to lock ourselves in to a single platform or a single vendor on it.

High-level languages were developed so that you could port your programs over to new or different architectures. The software written in x86 assembly and tuned for a specific CPU might be really fast on that one chip, but it might lose performance on the next CPU just because the hardware is built differently in a way that the software clashes with. It's why AMD moved away from VLIW with their GPUs.

We added additional abstraction layers on top of that for similar reasons. As in I do t care what CPU your Windows PC is running on, I want to develop something that doesn't have to care about which version or even if it's running Windows at all. I don't want to maintain 30 separate code bases so that my program runs fast on different versions of Windows for multiple versions on different architectures across different vendors. No, give me an abstraction layer. The difference is between reasonably fast software or none at all.
 

CakeMonster

Golden Member
Nov 22, 2012
1,428
535
136
New ASUS beta BIOS with AGESA 1.2.0.0 is out:


This is IIRC the third version with 'new CPU' support, so we're getting there...
 

Fjodor2001

Diamond Member
Feb 6, 2010
3,926
404
126
I know some think the Win12 AI PC 40 TOPS requirement is BS on DT.

But just out of curiosity, do you think Zen5 will be able to brute force that requirement using AVX512, or some other generic instruction set? On top-end 9950X only, or even on low-end 9600X?

If not, do you think there'll be some 40+ TOPS NPU that can be added via PCIe extension card to fulfill the requirement (i.e. not using generic discrete GPU)?

Or will AMD give up Win12 AI PC on DT using CPU alone, and hand over that market segment to Intel which is known to fulfill it on Arrow Lake Desktop CPUs?
 

Nothingness

Platinum Member
Jul 3, 2013
2,752
1,402
136
I know some think the Win12 AI PC 40 TOPS requirement is BS on DT.

But just out of curiosity, do you think Zen5 will be able to brute force that requirement using AVX512, or some other generic instruction set? On top-end 9950X only, or even on low-end 9600X?
The math: 2 op (mul+add) * 64 (int8) * nb 512-bit MLA units * nb cores * frequency.
Assuming 4x512b MLA (doubtful), 8 cores and 5Ghz: 2*64*4*8*5=20 TOPS
I doubt even the 16-core 9950X will reach that.
 

StefanR5R

Elite Member
Dec 10, 2016
5,689
8,260
136
do you think there'll be some 40+ TOPS NPU that can be added via PCIe extension card to fulfill the requirement
That would be an expensive way to obtain the right to put a certain sticker on a low margin product.

PS, curious that MS defines requirements in terms of TOPS and of presence of the Copilot key on the keyboard, but not in terms of RAM size which is not entirely unimportant for local inferencing. (Also, RAM bandwidth?)
 
Last edited:
Reactions: Tlh97 and RnR_au

jamescox

Senior member
Nov 11, 2009
642
1,104
136
I don't have time to read this whole thread. I was wondering if anyone has done or seen a die size analysis for zen5c? It seemed weird that initial official pictures were all at an angle while the regular zen5 was straight top down photo, but looked like they showed a top down photo in the computex presentation. It still seems like the die to too long and narrow. I guess it may just be that the infinity fabric layout was changed significantly to fit it on the package. Does anything else make sense though? Would an npu make sense for zen5c? I assume there are a lot of servers just doing inferencing, so it seems like a cpu with npu may make sense for lowest possible power consumption.
 

Hitman928

Diamond Member
Apr 15, 2012
5,600
8,793
136
I don't have time to read this whole thread. I was wondering if anyone has done or seen a die size analysis for zen5c? It seemed weird that initial official pictures were all at an angle while the regular zen5 was straight top down photo, but looked like they showed a top down photo in the computex presentation. It still seems like the die to too long and narrow. I guess it may just be that the infinity fabric layout was changed significantly to fit it on the package. Does anything else make sense though? Would an npu make sense for zen5c? I assume there are a lot of servers just doing inferencing, so it seems like a cpu with npu may make sense for lowest possible power consumption.

I got ~3.2 mm2 with L2 included IIRC. I can double check that number tomorrow.
 
Reactions: inf64

jamescox

Senior member
Nov 11, 2009
642
1,104
136
None I guess. But this is the reason I believe AMD is capacity constrained with regards to V-cache. If they had abundant supply, they would have a LOT more SKUs, like maybe so:

4-core V-cache single CCD (Ryzen 3 X3D!)
6-core V-cache single CCD (Ryzen 5 X3D!)
8-core V-cache CCD + 4 core CCD
8-core V-cache CCD + 6 core CCD
4-core V-cache CCD + 4-core V-cache CCD
and who knows how many more!

Please understand that V-cache is a big CACHE die, prone to defects more than logic dies. We have no idea what the V-cache yield rate is and the process of attaching V-cache to the CCD is no simple matter and causes production to slow down enough that they can only do something like 40,000 V-cache CPUs a month (last I heard. Not sure about their latest production figures).
It doesn't make sense to put v-cache on low end consumer cpu packages. That wouldn't be very profitable. They get the best return on the higher end parts, although that probably still low compared to Epyc v-cache parts. The 16-core epyc x-series is listed at $4928, the 32-core is listed at $5529, and the 96-core is almost $14,756. The 4-core dies with v-cache are going into the the Epyc 9384X (8 CCD * 4 active cores; 768 MB of L3).
 

Aeonsim

Junior Member
May 10, 2020
6
8
81
For potentially radical things they could do with a V-cache die:

1) Double stack them or increase the size, one CPU core with two attached V-cache blocks, how does 128MB of v-cache sound?
2) Use one as a system level cache attached to an IO die, with or without one or two attached to the cores as L3 cache (lots of modern designs have a system cache)?

In general, guys, I think many of you are undervaluing the marketing value of having the 'fastest' CPU (for gaming) on the planet at any one time. Clearly, AMD makes decent money from the X3D series, as can be determined by the number of different X3D SKUs they offer beyond the server space.

5600X3D
5700X3D
5800X3D
7800X3D
7900X3D
7950X3D

I also believe there was an interview/tour of one of the AMD labs where prototype X3D chips with double X3D dies were mentioned, so they've thought about it before.
 

tsamolotoff

Member
May 19, 2019
62
95
91
potentially radical things
- monolithic die with dual GMI
- x3d stack that is produced with the same tech process as the base die so it won't go poof on high-V
- sanding the dies properly and avoiding using some sort of zero thermal conductivity 'glue' to attach 'structural silicon' (or maybe placing the x3d tile under the actual core die as cache does not really produce any sort of substantial heat output)

But we know AMD loves to penny pinch so no luck here 💲💲
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |