- Mar 3, 2017
- 1,773
- 6,749
- 136
It doesn't solve any scheduling issues. If you want something with a large cache, pin it to the same CCD. AMD does this is software now.It's very marginally faster (frequency wise) in Zen 5 - and no, that's not better because large discrepancy in cache size makes those cores not the same, therefore presenting problem for scheduler.
I am arguing for two chiplets with 3D cache rates to same frequency, as exactly the same as possible to reduce scheduling issues, plus NUMA mode should be activatable in BIOS so that OS knows it's different cache domains.
Perhaps; I didn't pay close attention to all this circular discussion of dual V-cache. My own comments were on homogeneous vs. heterogeneous CPUs. And I let myself get pulled into this only because I too easily get irritated by apologism of heterogeneous CPUs.Sorry I thought we were discussing why there is no client halo part with 2 x3d chiplets.
vcache on both chiplets does not help with the scheduler issues!
Wrong.
It is not this easy.Since with Zen5 the freq diff is small enough then x3d chiplet is the usually the better choice.
The problem is that the round-trip from one CCD to another takes forever, not just that only one of the CCDs has vcache. If they both had vcache, you still would take horrible penalty if game threads got scheduled across the split.
"The problem" which you are stating only exists if threads share large hot data.
I take your word for it, as I last ran computer games myself (and even wrote one) in the 1990ies.This describes all games.
No, what I want is 3D cache on both chiplets that are rated to same frequency (so no B grade 2nd chiplet), that's what I want and prepared to pay for it ($1K max).What you want is games to pin to one CCD
The CCD with the 3D-Vcache runs slower because of the cache, not becuase of binning. See the 9800X3D clock speeds versus the 9700X clock speeds.No, what I want is 3D cache on both chiplets that are rated to same frequency (so no B grade 2nd chiplet), that's what I want and prepared to pay for it ($1K max).
What's the problem? It's an upsell to AMD easily done.
2 slow CCDs and cost $100 more, think of the value lolThe CCD with the 3D-Vcache runs slower because of the cache, not becuase of binning. See the 9800X3D clock speeds versus the 9700X clock speeds.
If AMD made a CPU where both CCDs had a cache die, it would have two slow CCDs, not two fast CCDs.
Sure. So, in the cases where one core is maxed out in something, the other has a vcache and can perform those functions quicker if it's sensitive to vcache and slower if it's not. That's all well and good.No, what I want is 3D cache on both chiplets that are rated to same frequency (so no B grade 2nd chiplet), that's what I want and prepared to pay for it ($1K max).
What's the problem? It's an upsell to AMD easily done.
Obviously, but in Zen 5 clocks for 3D version is far closer to non-3D for the difference to be immaterial in my view, what's material is uneven chiplets both in terms of frequency and also cache, this may have been the only way in Zen 4, but now the downside of having non-3D chiplet with 3% faster clocks is a downside, not upside.The CCD with the 3D-Vcache runs slower because of the cache, not becuase of binning
It's not only about clock speed. The V$ die isn't universally faster than the non-V$ die even at the same clock, because larger cache incurs (well, at least on Zen 4 it did) 4-cycle penalty for accessing L3. So if your workload fits within 32MB, it would be faster on the non-V$ die even if it clocked the same.Obviously, but in Zen 5 clocks for 3D version is far closer to non-3D for the difference to be immaterial in my view, what's material is uneven chiplets both in terms of frequency and also cache, this may have been the only way in Zen 4, but now the downside of having non-3D chiplet with 3% faster clocks is a downside, not upside.
Anything that is not even makes scheduling harder
I agree. Alas it's anyone's guess whether or not AMD is ever going to treat AM5 EPYC (alias EPYC 4000) better than a least-effort Ryzen derivate.You're not getting uniformity unless AMD gives them an Epyc haircut.
Actually the operating system is well aware of cache topology. (Linux is; I suppose Windows is too.) It just does not apply a cache-aware scheduling policy by default. I suspect this is because the kernel authors do not believe that a generally good enough default policy exists. This is in contrast to non-uniform main memory access (NUMA): Operating systems (Linux at least, I suppose several Windows flavors too) actually do apply a default NUMA-aware scheduling policy. (1. Try to keep a process, including all of its subthreads, running on one NUMA node, such that the process accesses mostly near memory. 2. Spread the overall load from different processes across NUMA nodes. This is a good NUMA related policy in many but not all cases.)[…] plus NUMA mode should be activatable in BIOS so that OS knows it's different cache domains.
a) He referred to performance determinism, not to extreme peak performance.If AMD made a CPU where both CCDs had a cache die, it would have two slow CCDs, not two fast CCDs.
a) They are not slow. b) Performance is workload dependent, and thereby is value.2 slow CCDs and cost $100 more, think of the value lol
This is wrong if you consider the particular workloads in which 96 MB L3$/CCX actually make a difference to 32 MB L3$/CCX.The whole discussion about determinism using two more similar CCDs with X3D is very silly because perfect determinism is already defeated by the cores on the same die already have different max f and f/v behavior.
A dual-CCD CPU which runs twelve or more computationally intense threads does not run at f_max either.And servers achieve more determinism by generally clocking lower.
you buy a vanilla CPU.if your workload fits within 32MB,
I think this a red herring? I think the point here is that, even if the cores were exactly the same, going across to the other CCD will incur a penalty whether it's the same or not. You want all threads on one CCD when you can. If everything was identical, you'd still want to pin the game to one CCD.It's not only about clock speed. The V$ die isn't universally faster than the non-V$ die even at the same clock, because larger cache incurs (well, at least on Zen 4 it did) 4-cycle penalty for accessing L3. So if your workload fits within 32MB, it would be faster on the non-V$ die even if it clocked the same.
View attachment 120318
What generation is the 4004? I don't keep track of all the server parts, but if it is an older generation like Zen 2 or 3 etc, it may be faster in rated frequency but still be using an older architecture, thus often slower clock per clock. Also, in actual benchmarks/usage, it may be slower.Question.. Why is a 16 core 4004 EPYC(4.5 ghz) + motherboard($399+$299 $698 total) close to the price of a 9950x(4.3 ghz, $542) and the 4004 runs faster ????
It's Zen 4. The 4564P he's talking about is a 7950X by another name.What generation is the 4004? I don't keep track of all the server parts, but if it is an older generation like Zen 2 or 3 etc, it may be faster in rated frequency but still be using an older architecture, thus often slower clock per clock. Also, in actual benchmarks/usage, it may be slower.
Zen4 VS Zen5? The 7950X also has a base clock of 4.5Ghz.Question.. Why is a 16 core 4004 EPYC(4.5 ghz) + motherboard($399+$299 $698 total) close to the price of a 9950x(4.3 ghz, $542) and the 4004 runs faster ????
Thanks, I missed that. It says gen 4 in the description of the product. No way I want it then,It's Zen 4. The 4564P he's talking about is a 7950X by another name.
see above.Ok so that helps answer the question...the 4004 is a generation older, though still not much older. You may be able to get it at a bit of a discount though. And where are you seeing it for this price? Prices will vary by seller, and especially can be lower from used on Ebay and similar.
I am afraid we are arguing about 2 different things. All what I am saying that 9950x3D has nullified the weakness of 7950x3D by minimising the frequency difference between vanilla and x3D cache, to the point where betting on x3D CCD as the default one is set and forget solution, as it won't hurt your general performance meaningfully. Something that could not be said about 7950x3D. Since the OS cannot know if the apps its running prefer MHz or MBs of cache, until they tell it, so 9950x3D is easier to setup for the scheduler than 7950x3D in general case. Adding second x3D CCD wouldn't make it job meaningfully easier as the biggest problem with 2 CCDs is that they are 2 CCDs. But yes some workloads would see better performance from 2x x3D CCDs. It's just orthogonal to the scheduling problem, IMO.It is not this easy.
Its quite a penalty, both in latency and absolute bandwidth. Current GMI3 links are at 36 GB/s, but looking at Aida64 results of 9800X3Ds on google, L3 bandwidth clocks in at over 700GB/s. So an order of magnitude reduction (and then some) in IO speed just to request from the cache on a different CCD.I think this a red herring? I think the point here is that, even if the cores were exactly the same, going across to the other CCD will incur a penalty whether it's the same or not. You want all threads on one CCD when you can. If everything was identical, you'd still want to pin the game to one CCD.
Seems like cache communication between cores is limited by the laws of physics. So until there's a new law of physics, focus on what is one off from the current market. So much speculation in the thread appears to throw the baby out with the bathwater. There is no magic whatif coming in Zen 5. Its awesome that AMD broke the decorum demanding no big caches for each successive generation. But the innovation today is pretty cool how talk about L1, L2, etc now focuses on whole chips being added on to the package.Its quite a penalty, both in latency and absolute bandwidth. Current GMI3 links are at 36 GB/s, but looking at Aida64 results of 9800X3Ds on google, L3 bandwidth clocks in at over 700GB/s. So an order of magnitude reduction (and then some) in IO speed just to request from the cache on a different CCD.