- Mar 3, 2017
- 1,746
- 6,586
- 136
I agree.These are 2 corner cases that may amount to low percentage of scenarios, probably low single digit. Most typical case is that a thread mostly uses its own data, and typically, it does not jump from CCD to CCD.
How do you partition the data? Just give a pointer to the whole data structure to each thread and make sure they don't do anything out of bounds? Or is your workload read-only meaning the thread doesn't need to update the data while it's working on it?Not really, it depends on application. The data I have been working on have large chunk of data for each thread. There will not be any inter process communication until the work is completed.
No? It's fairly important for HPC as well. Your calculations won't scale too much unless you code your program in a fashion that's initially NUMA-aware, MPI(ch)-aware and your thread overhead is kept at minimal values. It's just game development is currently at the bottom of the barrel as salaries and crunch are not conducive to attracting the talents who know what a pointer is and who can actually program things instead of doing proverbial 'match a square hole' thing with shapes made of managed code etc.the only programmers doing low level stuff are for embedded, weak and feeble cpus due to limited resources
If you are working in low level languages, you need to create work-groups (a method + pointer to data + event signal) and make them run as threads. then you wait for all/single work to complete. When a thread completes a task it will make a signal to notify.How do you partition the data? Just give a pointer to the whole data structure to each thread and make sure they don't do anything out of bounds? Or is your workload read-only meaning the thread doesn't need to update the data while it's working on it?
But yes, if an application's very nature requires inter-thread communication, it will get screwed with high latencies. I don't know what type of multithreading game engines do. I think the type of workload you are doing maybe corresponds more to something like CBR23.
This will shed more light on the inter-core communication part: https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/
View attachment 102189
It's not possible to achieve those highlighted requirements without constant inter-thread communication.
If you are working in low level languages
I presume if the application can benefit from more than 8 cores, running on 16C despite the interCCD penalty is still beneficial, no?I'm not sure if two v-cache CCDs are needed at all.
If we assume they managed to get the V-cache CCD Fmax equal to that of regular CCDs, you'll still end up with inter-CCD penalty. Different CCD Fmax will make things even worse.
P.S. Unless there's a big single L3 chunk glued-TSV'ed atop of both CCDs, with 16 slices forming a common for both CCDs ring (or whatever) bus ? )
True, you have to treat single work item as a single thread application.if....
and then the rest you describe about threads/multithreading is still not close enough at all to hardware
TBH multithreading is fairly basic, yet programmers treat it as OMG so advanced so difficult
I'm talking about real optimization of how software works even with L1/L2/L3 caches
for example, does the CPU actually store the right data in the caches to have maximum hit rate?
all this intrinsic CPU functionality is behind a black box, behind the "general purpose" computing package of the modern cpu
P.S. also, what I'm trying to describe is exactly the reason why Nvidia is the "richest" company by market cap in the world right now
No, it'd be required, because cache is divided into two segments that are treated as a whole by OS and applications that try to sync across whole CPU. For example, even if both your chiplets have extra cache, fps in CS2 will still get tanked as the game will try to use both CCDs at the same time. Same goes for Warzone and Call of Duty if the worker thread count is exceeding the limit of one chiplet. Every other 'scheduling' (which it is not, actually) issue can be fixed by setting 'prefer cache' option in the BIOS or via registry.no additional scheduling is required would be worth it imo. Certainly if the price is just another hundred bucks higher.
I'll try to answer that soon (I hope but soon for me these days ranges anywhere from a day to a week) with my 8 CCD Zen 2 CPUIn case of a client CPU with 2 CCDs, how likely is it that the demand on the CPU needs the 2nd CCD?
can you imagine how fast real world perf would be if there was better alignment between software and hardware?
Adding the CPU SKUs one at a time, are we, AMD?This is IIRC the third version with 'new CPU' support, so we're getting there...
The math: 2 op (mul+add) * 64 (int8) * nb 512-bit MLA units * nb cores * frequency.I know some think the Win12 AI PC 40 TOPS requirement is BS on DT.
But just out of curiosity, do you think Zen5 will be able to brute force that requirement using AVX512, or some other generic instruction set? On top-end 9950X only, or even on low-end 9600X?
That would be an expensive way to obtain the right to put a certain sticker on a low margin product.do you think there'll be some 40+ TOPS NPU that can be added via PCIe extension card to fulfill the requirement
Some are doing inferencing with 4 bit floating point. But not sure how that works on AVX512 if at all.The math: 2 op (mul+add) * 64 (int8) * nb 512-bit MLA units * nb cores * frequency.
Assuming 4x512b MLA (doubtful), 8 cores and 5Ghz: 2*64*4*8*5=20 TOPS
I doubt even the 16-core 9950X will reach that.
I don't have time to read this whole thread. I was wondering if anyone has done or seen a die size analysis for zen5c? It seemed weird that initial official pictures were all at an angle while the regular zen5 was straight top down photo, but looked like they showed a top down photo in the computex presentation. It still seems like the die to too long and narrow. I guess it may just be that the infinity fabric layout was changed significantly to fit it on the package. Does anything else make sense though? Would an npu make sense for zen5c? I assume there are a lot of servers just doing inferencing, so it seems like a cpu with npu may make sense for lowest possible power consumption.
It doesn't make sense to put v-cache on low end consumer cpu packages. That wouldn't be very profitable. They get the best return on the higher end parts, although that probably still low compared to Epyc v-cache parts. The 16-core epyc x-series is listed at $4928, the 32-core is listed at $5529, and the 96-core is almost $14,756. The 4-core dies with v-cache are going into the the Epyc 9384X (8 CCD * 4 active cores; 768 MB of L3).None I guess. But this is the reason I believe AMD is capacity constrained with regards to V-cache. If they had abundant supply, they would have a LOT more SKUs, like maybe so:
4-core V-cache single CCD (Ryzen 3 X3D!)
6-core V-cache single CCD (Ryzen 5 X3D!)
8-core V-cache CCD + 4 core CCD
8-core V-cache CCD + 6 core CCD
4-core V-cache CCD + 4-core V-cache CCD
and who knows how many more!
Please understand that V-cache is a big CACHE die, prone to defects more than logic dies. We have no idea what the V-cache yield rate is and the process of attaching V-cache to the CCD is no simple matter and causes production to slow down enough that they can only do something like 40,000 V-cache CPUs a month (last I heard. Not sure about their latest production figures).
- monolithic die with dual GMIpotentially radical things
Yes, TOPS is such an ill-defined term you never know what it means.Some are doing inferencing with 4 bit floating point. But not sure how that works on AVX512 if at all.