I added dual-32-core EPYC results to post
#92.
This server is like a cluster of 8 Ryzen 3700X, each one configured to anemic 39 W PPT.
The throughput optimum on this computer is with single-threaded tasks, number of concurrent tasks = core count (that is, SMT not used). In contrast,
@biodoc's 3700X @ 65 W (post
#100) gets best throughput with dual-threaded tasks, number of concurrent tasks = core count (that is, SMT used). This difference between these computers probably comes from the frugal per-core power budget of the EPYC.
Edit, maybe it's also a firmware thing. But if there is a relevant difference in the firmware, then it ultimately has to be connected with the different power budgets per core on EPYC and Ryzen.
To my surprise, the EPYC achieves best power efficiency (and at the same time near optimum throughput) with 8-threaded tasks
and use of SMT, that is, when one task uses as many threads as available per CCX. With this high throughput, we can reasonably assume that the Linux kernel ran this workload with a 1:1 mapping of tasks to CCXs.
Edit, but this alone does not explain why this worked so much better than 4-threaded tasks.
I also tested two configurations which were expected to give bad results, but I was curious to learn how bad:
- With 128 single-threaded tasks, the processor caches are no longer sufficient for this workload. The result is that throughput goes down to 63 % of the throughput optimum, and power efficiency goes down to 58 % of the throughput optimum.
- With 16-threaded tasks, one task needs to be executed on 2 CCXs. But while core-to-core communication within a CCX is extremely low-latency, communication across CCXs performs about as bad as communication via main memory. Apparently, the inter-thread communication in LLR2 is considerable, and therefore throughput goes down to 27 % of the throughput optimum, and power efficiency goes down to 32 % of the throughput optimum.