Unfortunately my attempts at writing a micro benchmark have produced wildly inconsistent results, so I do not feel comfortable posting them until I figure out what's going on, of which I don't have any more time for today. (It can be very difficult to test the right thing with micro benchmarks)
Sometime next week I plan to go through it more in-depth and read through an assembly listing to see what is actually being produced by MSVC.
Anyway, I'm interested in exploring the performance penalties of 'False Sharing' on Ryzen.
In effect, 'False Sharing' is what occurs when a thread is writing to data located on the same cache line that another thread is attempting to access. As I understand it, only one core can have a lock on the cache line when it's being modified, so this produces a dependency between threads that is not obvious to the programmer. It's a serialization of resource access where the entire cache line is bounced back and forth between cores, even if each thread are operating on completely different (but within the same cache line) memory locations, and are otherwise embarrassingly parallel.
So, what's the effects of false sharing on performance?
1. CPU utilization will appear high, even maxing out, while a given core is waiting to get exclusive access to a cache line (if it needs to perform a write).
2. It's very difficult to actually detect false sharing - most profilers will simply show you hot spots in your code, unless you have access to hardware performance counters.
In blue, all threads are on separate cache lines. In red, all threads share the same cache line, even when they're not modifying or reading from the same data. The performance can actually get worse (scaling with thread count) than just using a single thread in this kind of extreme scenario.
Of course, the following graph shows an extreme scenario of False Sharing, in the real world you might just see poor thread scaling as they won't constantly be accessing the same cache line, nor constantly writing.
Source:
https://mechanical-sympathy.blogspot.ca/2011/07/false-sharing.html
Here's a diagram of a cache line being accessed by two different threads. As you can clearly see, the threads are operating on different pieces of data.
On modern x86 CPUs, a cache line is 64 bytes (irrespective of 64 bit or 32 bit, to anyone who might try and make that connection). Thus, you could have an array of 16 integers (adjusted to the beginning of a cache line boundary), and if a single one is being modified by a thread while another is being read by another thread, the other thread will have to wait.
Source:
https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
Here's an example of how false sharing can look on a CPU utilization graph, and how deceiving it can be:
Source:
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206?pgno=2
Below is a single thread doing the same amount of work, but faster, and with significantly less overall CPU utilization: (the total amount of work being done appears to be much smaller, but it's actually the same!)
Source:
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206?pgno=2
So, what I'm interested in finding out, is:
1. How prevalent is false sharing in game engines? Even the software I tested on Page 9 (
http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=threads/ryzen-strictly-technical.2500572/page-9#post-38776310 ) appears to exhibit this behavior when I push it through a profiler (huge number of hot spots on a few load instructions). It also fails to scale beyond ~8 threads, with very little benefit beyond 4.
2. Is the serializing effects of false sharing (which essentially increases the 'serialized' portion in Amdahl's law with thread count, thus reducing scaling the more false sharing that occurs) exacerbated with NUMA and NUMA-like topologies (in the case of Ryzen)? Could the increased cache coherency traffic saturate the bandwidth of QPI, or, in the case of Ryzen, the inter-CCX fabric? Is there a higher latency penalty with cache line bouncing between CCX's vs within a single CCX?
Unfortunately I can't answer these questions as I don't believe I have the knowledge to give a clear answer, but I'm hoping someone like
@Dresdenboy would be able to chime in.