- Oct 22, 2004
- 805
- 1,394
- 136
The last few years have been particularly exciting in terms of CPU architectures. We have had AMD coming back strongly with their Zen architecture, with Intel pushing their Sky Lake derivatives, and recently innovating with their next-gen Cove cores, and even Apple doing exciting things with their CPU architecture effort. A notable part of the discussion among enthusiasts here and elsewhere has been about inter-core latency comparisons in the latest architectures.
So why "sadness" in the topic title? Well, I am both a CPU architecture enthusiast and a programmer, and I appreciate elegant architectures in both hardware and software. However, the discussion about inter-core latency has always made me picture the messy compromises hardware architects need to make, just to make lousy software run well, rather than optimise architectures for well-written and well-optimised software. In particular, inter-core latency appears quickly as a bottleneck for multi-threaded software that has excessive use of shared memory and locks. Conversely, inter-core latency has little bearing on well-written software that has each thread working independently. (Of course there are cases where even well-written software will be limited by memory sharing, but the solution space with varying degree of sharing seems vast, with common lock-based solutions being far from optimal.)
Although this has appeared obvious to me, with just my basic understanding of hardware and software, I have had little hands-on experience with multi-threaded programming. So I was delighted to have my intuition confirmed in a little experiment I have played with lately. The following screenshot and test results are from two versions of the upcoming Folder Size Calculator example in OWLNext, an open-source C++ application framework I am contributing to. While this particular application's speed is limited by the file system, the difference in efficiency between the lockless version and the shared queue version is substantial. The CPU even downclocks in the lockless solution test, while it boosts the clock frequency in the shared queue solution test.
The difference in efficiency is due to the shared queue solution putting much greater stress on the CPU's cache-coherency mechanism, I guess.
Edit: Sorry, the results are misleading and my conclusion wrong! See below.
So why "sadness" in the topic title? Well, I am both a CPU architecture enthusiast and a programmer, and I appreciate elegant architectures in both hardware and software. However, the discussion about inter-core latency has always made me picture the messy compromises hardware architects need to make, just to make lousy software run well, rather than optimise architectures for well-written and well-optimised software. In particular, inter-core latency appears quickly as a bottleneck for multi-threaded software that has excessive use of shared memory and locks. Conversely, inter-core latency has little bearing on well-written software that has each thread working independently. (Of course there are cases where even well-written software will be limited by memory sharing, but the solution space with varying degree of sharing seems vast, with common lock-based solutions being far from optimal.)
Although this has appeared obvious to me, with just my basic understanding of hardware and software, I have had little hands-on experience with multi-threaded programming. So I was delighted to have my intuition confirmed in a little experiment I have played with lately. The following screenshot and test results are from two versions of the upcoming Folder Size Calculator example in OWLNext, an open-source C++ application framework I am contributing to. While this particular application's speed is limited by the file system, the difference in efficiency between the lockless version and the shared queue version is substantial. The CPU even downclocks in the lockless solution test, while it boosts the clock frequency in the shared queue solution test.
The difference in efficiency is due to the shared queue solution putting much greater stress on the CPU's cache-coherency mechanism, I guess.
Edit: Sorry, the results are misleading and my conclusion wrong! See below.
Last edited: