Anyway, re: how to multithread games
I'm pretty sure the optimal thing, since there clearly seems to be DX12 and Vulkan issues where they rely on the shared cache and interconnectivity, is to have those running on the other CCX and "only" using those 8 threads for the actual graphics API calls. Also to only have graphics drivers running on one CCX, and somehow on that same one despite that being a separate application.
This might seem suboptimal, but the other CCX wouldn't sit there doing nothing. You still have your main loop running, i/o handlers, sound, and so on running on that CCX. You can be preparing the next draw calls to be sent to the other CCX
That, then, would be your only-cross CCX, going from your draw call prep to the actual draw calls. Which should be very little since you should run your physics and transforms on the "multithread" CCX
But even then... I'm not totally sure.
Like you want, it'd seem, to have your heavy single threaded tasks on one CCX without using SMT. So you don't want to load more on there.
Then on the other CCX, you want to run your tasks that are cheap to parallel-ize but that suffer from cross-CCX issues. Physics, object transforms, DX12/Vulkan... Pretty much the only single thread there being the main graphics API thread, your render loop, and your main dispatchers for those heavily parallel tasks if needed.
But... it's things going on on your main threads that trigger your object transforms happening.
So on top of that, you need some super tight optimized way to take your inputs and game systems on one side calling for transformations to happen on your actual render side.
In my mind, that's
probably the roughly optimal set up.
And this is similar to how many games are set up, as far as I'm aware, with 3-8 main threads but which can spawn a dozen or two, or three, more for their highly parallel tasks which get stacked on top the more heavy single threaded tasks.
I think what a lot of people miss is that with their traditional view, that seems intuitive, on how multithreading works is you have say a main thread, render thread, sound, i/o. They think these are separate systems that you can split up easy.
But... that's not really the case. There's a performance
penalty to splitting those up. Clock for clock (or watt for watt maybe more apt), things run worse split up like this and they tend to rely on L3 to make that work a bit better but not still quite as good.
But then add to that, people tend to think physics is another thing that might be put on one thread.
But no, physics on the other hand is often something that will run
more efficiently when split up into many threads and using SMT. Similar to how in Cinebench, the 1800X gets a 162 single threaded score. Yet, the multicore is 1624. This is 10 times higher, not 8 times higher as a layman might expect.
So this is why I think the optimal layout is that you keep your main thread and other things that need to be kept "close" to it on one CCX, and you put your things that benefit from being parallel on the other.
Oh, and the other thing is that you can't benefit from SMT as much if you have something uneven blocking it, but on the other CCX with almost none of that going on you're free to write everything in a way that can benefit from SMT.
We get about 45ns or so within cores. We get 98ns or so within RAM. 45 ns (from within the CCX) + 98 ns (from the memory latency) = the 143 ns, which is about what PC Perspective is getting.
This is what I was going over in my post above yours..
That no, this can't be it because it'd be 98ns to write to memory from the pinging core.
But when it comes to reading.. so say it instantly starts to read, well wouldn't it try to read from memory simultaneously to looking in its L3 cache? It should catch it a moment after it was just written.
If Ryzen had a lattency of 142 every time it tried to read from memory, that would mean it's checking L3 cache then memory first each time and it's truly 98-42ns, since you'd be factoring in the seeking from L3 cache before it actually. Which I don't believe it's that fast, I believe they're attempted simutaniously...
I'm explaining this very poorly..
Basically if the MMU checks L3 before even trying to read memory, that'd mean that the real memory latency is 98-42ns.
So the write speed should have been 56ns. And the read speed on the other core sould have been 98ns instead of 142ns. You're missing 44ns.
None of that can be true as far as I see, but I think I'm still wording this poorly.