Discussion Core-to-core latency sadness

Vattila · Jan 2, 2022

The last few years have been particularly exciting in terms of CPU architectures. We have had AMD coming back strongly with their Zen architecture, with Intel pushing their Sky Lake derivatives, and recently innovating with their next-gen Cove cores, and even Apple doing exciting things with their CPU architecture effort. A notable part of the discussion among enthusiasts here and elsewhere has been about inter-core latency comparisons in the latest architectures.

So why "sadness" in the topic title? Well, I am both a CPU architecture enthusiast and a programmer, and I appreciate elegant architectures in both hardware and software. However, the discussion about inter-core latency has always made me picture the messy compromises hardware architects need to make, just to make lousy software run well, rather than optimise architectures for well-written and well-optimised software. In particular, inter-core latency appears quickly as a bottleneck for multi-threaded software that has excessive use of shared memory and locks. Conversely, inter-core latency has little bearing on well-written software that has each thread working independently. (Of course there are cases where even well-written software will be limited by memory sharing, but the solution space with varying degree of sharing seems vast, with common lock-based solutions being far from optimal.)

Although this has appeared obvious to me, with just my basic understanding of hardware and software, I have had little hands-on experience with multi-threaded programming. So I was delighted to have my intuition confirmed in a little experiment I have played with lately. The following screenshot and test results are from two versions of the upcoming Folder Size Calculator example in OWLNext, an open-source C++ application framework I am contributing to. While this particular application's speed is limited by the file system, the difference in efficiency between the lockless version and the shared queue version is substantial. The CPU even downclocks in the lockless solution test, while it boosts the clock frequency in the shared queue solution test.

The difference in efficiency is due to the shared queue solution putting much greater stress on the CPU's cache-coherency mechanism, I guess.

Edit: Sorry, the results are misleading and my conclusion wrong! See below.

naukkis · Jan 2, 2022

For memory coherence there's a need to invalidate every other copy of data from other caches and memory before cpu can write to it's cache. Locks aren't any different, it's just a software solution to make continuous memory regions to act as simple structure where given only one core to use that region at time to maintain data integrity.

There's a way to speed it up, hardware transactional memory. So instead of software locks making hardware to track reads and writes to memory regions and abort writings if other core activity is found. That approach, when done properly, makes multi-threading programming not only faster but also much easier as memory coherency is maintained with hardware instead of software locking. Intel is half-way making that to work, their hardware have been buggy and limited in transaction size so code needs additional non-transactional part too but when they got it right it will be holy grail of multithread programming.

Vattila · Jan 2, 2022

Ugh — this quickly turned into a lesson about confirmation bias rather than a lesson about multi-threading efficiency.

It turns out that the main thread (GUI thread) in the shared queue solution was not allowed to go to sleep. Even with nothing to do, it kept spinning in a busy loop waiting for data from the worker thread. After correcting this, both versions perform more or less identically. Both show ~8% CPU utilisation and have a completion time within a fraction of a percent of each other. Perhaps the shared queue version wins more consistently, but the margin is tiny. This may be just down to processing batch size in the lockless version, or other fine and coincidental implementation detail, and nothing much to do with the multi-threading approach.

The moral of the story is to beware of confirmation bias. This particular test case is not suited to demonstrate anything about multi-threading efficiency, and my testing methodology was sorely lacking. Sorry for that. But the broader topic is still interesting though — how much does bad software affect CPU architecture decisions?

NTMBK · Jan 2, 2022

Vattila said:
But the broader topic is still interesting though — how much does bad software affect CPU architecture decisions?

I mean, isn't that the whole reason why x86 is still popular?

I think that realism about software is very important for hardware design. Sure, some software is downright badly written- but also, sometimes it's just a hard problem to handle highly efficiently. Whenever hardware designers assume that they can just make an efficient-but-difficult architecture and leave it for compilers and developers to figure out, it generally fails. Think of the Cell processor- in theory it had huge floating point throughput. But it was an absolute pig to work with, and you basically had to write your algorithms from the ground up to be designed around the processor.

A good architecture will do a good job with whatever you throw at it. A poor architecture is one full of pitfalls and landmines for developers to step on.

naukkis · Jan 2, 2022

Vattila said:
Ugh — this quickly turned into a lesson about confirmation bias rather than a lesson about multi-threading efficiency.

It turns out that the main thread (GUI thread) in the shared queue solution was not allowed to go to sleep. Even with nothing to do, it kept spinning in a busy loop waiting for data from the worker thread. After correcting this, both versions perform more or less identically. Both show ~8% CPU utilisation and have a completion time within a fraction of a percent of each other. Perhaps the shared queue version wins more consistently, but the margin is tiny. This may be just down to processing batch size in the lockless version, or other fine and coincidental implementation detail, and nothing much to do with the multi-threading approach.

The moral of the story is to beware of confirmation bias. This particular test case is not suited to demonstrate anything about multi-threading efficiency, and my testing methodology was sorely lacking. Sorry for that. But the broader topic is still interesting though — how much does bad software affect CPU architecture decisions?

Software locking is hard thing to do - and using spinlocks which is usually the first implemented locking type, is always pure stupidity. It was absolutely moronic in single-core era, when thread spinning wasted whole timeslice preventing other threads doing something useful like releasing that lock what spinner was waiting, and it still is moronic thing to do in multi-thread systems too. Thread spinning without wait states uses 100% cpu core power and turbo it to max doing nothing, and with wait times it wastes one core to waiting and actually decreases performance as execution continues after wait state ends instead of actual lock release.

-> cpus should have implemented transactional memory regions ages ago, still waiting one properly implemented solution....

-->> stupid locking schemes won't affect hardware designs at all - hardware is designed to maintain memory coherency and nowadays it's done using write-allocate MESI -scheme - all other data location allocations will be invalidated before allowing data writing. Core to core latency is then result of coherency resolving hardware speed and complexity, the more cores and cache level complexity system has the slower core to core latency compared to intercore latency.

Thala · Jan 2, 2022

naukkis said:
Software locking is hard thing to do - and using spinlocks which is usually the first implemented locking type, is always pure stupidity. It was absolutely moronic in single-core era, when thread spinning wasted whole timeslice preventing other threads doing something useful like releasing that lock what spinner was waiting, and it still is moronic thing to do in multi-thread systems too.

In single core implementations spinning never happens.

Thread spinning without wait states uses 100% cpu core power and turbo it to max doing nothing, and with wait times it wastes one core to waiting and actually decreases performance as execution continues after wait state ends instead of actual lock release.

That has never been the case for ARM since at least v7, since the core not being able to acquire the lock goes into clock-gating via WFE. In addition, spinlocks are supposed to be used for very short critical sections, otherwise use a mutex - which happens to release processing time to other threads. Spinlocks are typically only available in the kernel anyway - and not in user space.

DisEnchantment · Jan 2, 2022

Vattila said:
Ugh — this quickly turned into a lesson about confirmation bias rather than a lesson about multi-threading efficiency.

It turns out that the main thread (GUI thread) in the shared queue solution was not allowed to go to sleep. Even with nothing to do, it kept spinning in a busy loop waiting for data from the worker thread. After correcting this, both versions perform more or less identically. Both show ~8% CPU utilisation and have a completion time within a fraction of a percent of each other. Perhaps the shared queue version wins more consistently, but the margin is tiny. This may be just down to processing batch size in the lockless version, or other fine and coincidental implementation detail, and nothing much to do with the multi-threading approach.

The moral of the story is to beware of confirmation bias. This particular test case is not suited to demonstrate anything about multi-threading efficiency, and my testing methodology was sorely lacking. Sorry for that. But the broader topic is still interesting though — how much does bad software affect CPU architecture decisions?

In most multi-threading scenarios, software architecture is more likely going to hinder performance a lot before hitting HW limitations on modern CPUs.

Lockless algorithms are quite well understood now, producer consumer design patterns have been known to solve a lot of concurrent access issues.
Lockless algorithms in principle should allow for more performance except for corner cases, if properly designed
Beyond the point where the producer-consumer pattern cannot solve, for example when you need fine grained data access arbitration or when your hit a single contention point in lockless algorithms, you can do atomic access to test(e.g. atomic_compare_exchange_strong, atomic_fetch_and, atomic_fetch_add and the likes) in combination with futex syscall, for example.

In Linux you can do this, and your thread will stall, the thread will transition to sleep and the core will idle if no other process is ready and will wake when the producer/consumer is done.
syscall(__NR_futex, uaddr, futex_op, val, timeout, uaddr, val3)
Most semaphores and posix mutexes lands up in futex syscall with atomic compare operation.

During concurrent access when employing lockless algorithms, as far as I am aware, there are two different things at play,
1. Fencing operations and lock operations during atomic memory access of test variables
2. Waiting for the other thread to be done with the data, which might involve doing something else before writing data back to be consumed by other thread

For #1 Depending on what you what to do, you may want only fencing and/or locking.
Fencing stalls younger loads/stores and the core does something else while waiting for the instruction to retire.
Hardware Lock Elision / TSX allows concurrent atomic operations from multiple cores which would otherwise have blocked all other load or store atomic operations. In case of conflict the loads/stores from the multiple cores are serialized.
The core will continue to perform speculative operations until limited by PRF or ROB after which they are stalled.

#2 is a different matter which involves waiting for the producer to be done with the data before the consumer can move on
You can force the process to sleep by doing a futex call in software(but context switch is expensive in extreme latency sensitive scenarios) or you can straight up loop waiting for the atomic test operation without using mutexes if you are after extreme responsiveness

While inter core latency is different matter, this fencing/lock operation is impacted by the snooping latency within LLC as well as the atomic access data used as a hazard check
But all of these limits can be mitigated by applying clever producer-consumer patterns

moinmoin · Jan 2, 2022

Vattila said:
But the broader topic is still interesting though — how much does bad software affect CPU architecture decisions?

Well, the whole mess that is x86 is due to accommodating "optimized" software.

Reminds me of

Page 4 - Question - Should Intel or AMD break backward compatibility and do a clean x86-64 reboot?

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

@zir_blazer's History of the evolution of the x86 platform, from the IBM PC to the modern era is well worth the read for seeing how the pursuit of compatibility introduced plenty awkward design choices.

Thala · Jan 2, 2022

DisEnchantment said:
During concurrent access when employing lockless algorithms, as far as I am aware, there are two different things at play,
1. Fencing operations and lock operations during atomic memory access of test variables
2. Waiting for the other thread to be done with the data, which might involve doing something else before writing data back to be consumed by other thread

The point of a lockless algorithm is, that you do _NOT_ use a lock. You are using atomics instead. If your atomic operation fails, you just redo the algorithm from an earlier point and try the atomic again.

DisEnchantment · Jan 3, 2022

Thala said:
The point of a lockless algorithm is, that you do _NOT_ use a lock. You are using atomics instead. If your atomic operation fails, you just redo the algorithm from an earlier point and try the atomic again.

Hhhhmm ... you are just cherry picking some sentences and I guess you did not read my whole post.

Just to illustrate what I am trying to say

Producer thread

if (atomic_compare_exchange_strong(futex_var, &zero, 1)) { // check if consumer is not free running
ret = futex(futex_var, FUTEX_WAKE, 1, NULL, NULL, 0); // wake consumer if it is not in loop
futex_value = 1;
}else {
futex_value = atomic_fetch_add(futex_var, 1) + 1; // consumer free running
}

Consumer thread

while (1) {
if (atomic_fetch_and(futex_var, 0xFFFFFFFF)){ // can also loop here
break;
}
ret = futex(futex_var, FUTEX_WAIT, 0, NULL, NULL, 0); // or sleep instead to yield
}
futex_value = atomic_fetch_sub(futex_var, 1) - 1; // producer free running

The result of the atomic operation futex_value could be used to index a queue for example. This value is a pair which is so called space and count in a producer/consumer pattern.
But the HW will still serialize the load/store access to that futex_var pair from two cores.
This is however a simplistic two threaded example, for more scalable algorithms involving bigger data structures and much more number of threads, fencing/Hardware lock elision is needed by the HW
If no lock or serialization is needed at all why is HLE there in the first place.

Thala · Jan 3, 2022

DisEnchantment said:
Hhhhmm ... you are just cherry picking some sentences and I guess you did not read my whole post.

Indeed i was referring to a subset of your post, because the remainder seemed reasonable. I am not sure, why you mix lockless algorithms with the usage of futex. When using a futex, your algorithm apparently is not lockfree. In fact a futex is just an optimization to save system calls when locking (or implementing a mutex).
Again - a lockless algorithm is not using locks (including futexes).

DisEnchantment · Jan 3, 2022

Thala said:
I am not sure, why you mix lockless algorithms with the usage of futex. When using a futex, your algorithm apparently is not lockfree. In fact a futex is just an optimization to save system calls when locking (or implementing a mutex).
Again - a lockless algorithm is not using locks (including futexes).

There are two parts here.

No two or more threads modify the same synchronizing variable and are therefore lock free being able to run freely without blocking the other thread. (Wiki link above also has more elaboration)
One simplistic illustration,
In a classical Dijkstra producer consumer pattern, the producer thread modifies the write indexing counter and reads the read indexing counter (to know where to write next).
The consumer thread modifies the read indexing counter and reads the write indexing counter (to know where to read next)
Since they don't modify the same indexing counter there is no locking involved (in software with mutex, semaphores etc).
Reading doesn't need protection from writing (just need atomicity) due to the nature of these being incrementing counters and during concurrent access reading an older index value is OK because the operation will simply loop again when the newer counter value is seen by the other thread.

The futex part is optional,
The futex is when they reached the end, when the read indexing counter has reached the write indexing counter, the threads are finished and they simply need to sleep, instead of endlessly looping testing for the atomic variable.
Futex can be replaced with a loop that perform continuous atomic test (which I have shown in the code snippet above to be optional)

Bigos · Jan 3, 2022

Thala said:
Spinlocks are typically only available in the kernel anyway - and not in user space.

Well, you can implement spinlocks anywhere, however only the kernel should be allowed to use them (and only in specific circumstances). Spinlocks for even the tiniest "non-blocking" critical sections can block when:
1. The thread is preempted by the kernel (the kernel itself can alleviate that by e.g. disabling interrupts)
2. A page fault is generated (not sure how kernel deals with these, maybe the memory used is marked unswappable)

Relevant Linus rant: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723

DisEnchantment said:
There are two parts here.

No two or more threads modify the same synchronizing variable and are therefore lock free being able to run freely without blocking the other thread. (Wiki link above also has more elaboration)
One simplistic illustration,
In a classical Dijkstra producer consumer pattern, the producer thread modifies the write indexing counter and reads the read indexing counter (to know where to write next).
The consumer thread modifies the read indexing counter and reads the write indexing counter (to know where to read next)
Since they don't modify the same indexing counter there is no locking involved (in software with mutex, semaphores etc).
Reading doesn't need protection from writing (just need atomicity) due to the nature of these being incrementing counters and during concurrent access reading an older index value is OK because the operation will simply loop again when the newer counter value is seen by the other thread.

The futex part is optional,
The futex is when they reached the end, when the read indexing counter has reached the write indexing counter, the threads are finished and they simply need to sleep, instead of endlessly looping testing for the atomic variable.
Futex can be replaced with a loop that perform continuous atomic test (which I have shown in the code snippet above to be optional)

It is common for locking code to accelerate the uncontested case using a fast path that does not involve the kernel. A mutex implemented using a futex does that for example, it will usually lock using a CAS (or similar atomic operation) and fall back to a futex only when it fails.

What you are describing is literally a pair of semaphores with such a fast-path. These semaphores are controlling access to an array. Such an algorithm can indeed be mostly syscall-free in practice, but so can a fine-grained mutex-protected graph.

And I think it is fine to fall back to a futex (or a similar facility) when the resource is contested. See above about how wrong it is to spin in user space.

Bonus content: atomic read-modify-write operations are technically locking by the hardware on the cache line level. As such, you can split lock-free algorithms into ones that don't need to use such operations and those that require them. I find this article to be an excellent analysis of the topic: https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html

DisEnchantment · Jan 3, 2022

Bigos said:
What you are describing is literally a pair of semaphores with such a fast-path. These semaphores are controlling access to an array. Such an algorithm can indeed be mostly syscall-free in practice, but so can a fine-grained mutex-protected graph.

This is a well known pattern which is seen not only in Linux but other systems like GHS Integrity as well.
But indeed in Linux I have done this using a pair of counting semaphores. Blocking will happen only when there is no more data. But this is usually desired. No locking will happen when both threads are producing and consuming.

Bigos said:
Bonus content: atomic read-modify-write operations are technically locking by the hardware on the cache line level. As such, you can split lock-free algorithms into ones that don't need to use such operations and those that require them. I find this article to be an excellent analysis of the topic: https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html

Yeah, many atomic operation seems lock free from software perspective but the access at cache line level will get serialized/locked.
For bigger data chunks HLE is what is supposed to help (but apparently the bugs and vulnerabilities indicate the feature is complicated)

Thala · Jan 3, 2022

DisEnchantment said:
The futex part is optional,
The futex is when they reached the end, when the read indexing counter has reached the write indexing counter, the threads are finished and they simply need to sleep, instead of endlessly looping testing for the atomic variable.
Futex can be replaced with a loop that perform continuous atomic test (which I have shown in the code snippet above to be optional)

Of course the futex is optional because it is not required at all by the lockless FIFO implementation you did describe above. It just happens to be used as event mechanism on empty/full FIFO.
Also of note is, that if you have a single producer/consumer pair accessing the FIFO, you do not even have to use atomics. You need atomicity if you have multiple producers/consumers. (You always need single-copy atomicity though - see ARM ARM)

Bigos said:
Well, you can implement spinlocks anywhere, however only the kernel should be allowed to use them (and only in specific circumstances). Spinlocks for even the tiniest "non-blocking" critical sections can block when:
1. The thread is preempted by the kernel (the kernel itself can alleviate that by e.g. disabling interrupts)
2. A page fault is generated (not sure how kernel deals with these, maybe the memory used is marked unswappable)

Spinlocks are always prone to deadlocks when there is a chance that the current thread can be preempted. The Linux Kernel offers a set of spinlock implementations, where the variant which disables HW interrupts is the safest to use (and is the only option to be used at HW IRQ level) - but there are also variants which disables soft interrupts up to BH level etc.

naukkis · Jan 3, 2022

Thala said:
In single core implementations spinning never happens.

Yet it was so popular schematic in first multithreading implementations that every programming tutorial has it as an example what locking should not do. And using spinlock is using spinlock even if reasonable thing is done - thread is ended when lock is not available to give other threads possibility to free that lock. Then that spinlock is only spinning once at every thread switch but it's still a spinlock.

Other examples about spinlock stupidity is Pentium4. When P4 misses datacache in execution it will start to spinning that execution until data is in cache wasting power and stealing execution units from SMT-thread. But doing nothing but spinning is always the easiest way of implementation.

Thala · Jan 3, 2022

naukkis said:
Other examples about spinlock stupidity is Pentium4. When P4 misses datacache in execution it will start to spinning that execution until data is in cache wasting power and stealing execution units from SMT-thread. But doing nothing but spinning is always the easiest way of implementation.

Technically i would not call this an example of a spinlock - it is more like active waiting for a data dependence - but it is stupid nonetheless

naukkis · Jan 3, 2022

Thala said:
Technically i would not call this an example of a spinlock - it is more like active waiting for a data dependence - but it is stupid nonetheless

Spinlock also is active waiting for data dependence

Discussion Core-to-core latency sadness

Vattila

Senior member

naukkis

Golden Member

Vattila

Senior member

NTMBK

Lifer

naukkis

Golden Member

Thala

Golden Member

DisEnchantment

Golden Member

moinmoin

Diamond Member

Page 4 - Question - Should Intel or AMD break backward compatibility and do a clean x86-64 reboot?

Thala

Golden Member

DisEnchantment

Golden Member

Thala

Golden Member

DisEnchantment

Golden Member

Bigos

Member

DisEnchantment

Golden Member

Thala

Golden Member

naukkis

Golden Member

Thala

Golden Member

naukkis

Golden Member

TRENDING THREADS