I took measurements on a dual-7452. These measurements will be representative for a single-7452 as well. (The dual-7452 spends a part of the energy budget for the Infinity Fabric link between the two sockets, but it should not be much because inter-socket traffic is low in this sort of workload. Also, I increased PPT and TDP to 7452's possible maximum of 180 W in the BIOS, default is 155 W.)
I performed the measurements with a fixed workunit, in a scripted testbed. That is, the workloads consisted of multiple tasks from the same workunit running in parallel, and the same workunit being re-used in all test scenarios with different parallel task count and thread count. The consequence is that these tests are very precise, repeatable, and quick.
In contrast, observations of random workunits coming from PrimeGrid are not as conclusive, especially because there are two types of "main tasks" which have not only different durations but also different PPD, as I mentioned in
#153.
Which particular tests I ran and what the results were is posted in a private section of the
teamanandtech.org forum. However, I spilled the beans in
#153 already.
One 7452 has got 8 CCXs, each CCX made up of 4c/8t and 16 MB L3$. Here is how I am configuring them for the challenge:
- In the BIOS, "Advanced" --> "ACPI Settings", I am switching "ACPI SRAT L3 Cache As NUMA Domain" from "Auto" to "Enabled". (That's how it is labeled in Supermicro's AMI BIOS.)
The effect of this is that the firmware will present each CCX as a NUMA node to the operating system. A NUMA aware OS like Linux will attempt to keep multithreaded processes within a NUMA node. Declaring a CCX a NUMA node is a bit of a hack as a replacement for cache-aware scheduling. The latter is more problematic than NUMA-aware scheduling, and I am not sure whether cache topology plays a role at all in current Linux scheduler decisions. (There is a related scheduler change in Linux 5.16 which I
mentioned elsewhere, but this shouldn't affect all-core loads.)
In the output of the "lscpu -e" command, the NODE column will show the effect of this change. A dual-7452 system will then have the nodes 0…15 instead of nodes 0…1.
- For the first 4+ days of the 5 challenge days, I will run 8 tasks in parallel on each 7452 (that is, 16 tasks at once on a dual-7452 computer) = a 1:1 ratio of # tasks to # CCXs.
- I choose to let each task have 4 program threads.
- However, the CPU could of course give 8 hardware threads to each task, since there are 8 threads in each CCX. And indeed, using all threads would increase performance by a small percentage. It would also increase power draw but more than proportionally.
- So, since EPYC Rome isn't a power hog in the first place, you might prefer to go with 8 threads per task and spend that little bit of extra electric energy.
- During the last day, if I find the time, I may switch to fewer tasks at once combined with more program threads per task. This will sacrifice throughput but decrease run times. That way, the last hours of the challenge will be better filled out.
________
Edit: There is one thing though which I haven't considered yet at all. It's whether
@cellarnoise is to be classified as impish or as admirable.