[Setting up multiple client instances] is in scope of various guides which different people have written on the subject. It's not hard for users with some BOINC experience, but also not something which can be rushed.
Note, "some BOINC experience" on Linux in particular implies "some experience with Linux file ownership and access permissions" too. Which is trivial, except if one has only worked with single-user OSs all the time before, or with a multi-user OS which was made by the vendor to look like a single-user OS to the unsuspecting public. (I am of course referring to Windows NT which is made to look very much like DOS/Windows.)
That setting for Epycs seems like it might be really handy, does it inform the OS to treat each CCX as if it was a separate CPU socket?
The computer in post
#156 has got "ACPI SRAT L3 Cache as NUMA Domain" switched on in the BIOS. Hence, the firmware reported 32 NUMA nodes which coincide with the 32 last-level cache domains which this computer has.
There are also still 2 sockets reported by the firmware to the OS. But the OS's process scheduler cares for NUMA nodes, not for sockets.
Properties of a NUMA node are: Which logical CPUs belong to it, which physical memory range(s) are "near" to it (also, which block devices, network devices, and PCI devices are near to it), what relative distance its CPUs have to CPUs of the same and of other NUMA nodes.
numactl --hardware
shows some of these properties.
This relative distance is meant to reflect the latency hit which an access to "far" memory takes. The firmware reports this distance by means of an artificial weighting factor, not as a real physical measure. But it would for example make sense that all NUMA nodes which are located on the same socket are reported to have a shorter distance to each other than pairs of NUMA nodes which reside on different sockets. (I haven't checked if the Supermicro AMI BIOS which I am using reports differentiated NUMA distances, or simply same for all, when "ACPI SRAT L3 Cache as NUMA Domain" is enabled. Or when the "NUMA Nodes per Socket (NPS)" option is changed from the default.)
About the NPS option: The I/O die of EPYC Rome and Milan is internally segmented into four quadrants. Each quadrant is directly attached to 0β¦2 compute chiplets (depending on the EPYC SKU) and has got one dual-channel DDR4 memory controller. Chiplets have fastest access to the memory controller which sits on the same quadrant, and a tiny bit slower access to the memory controllers at the other three quadrants. This is the primary reason why AMD offer the NPS option, which defaults to 1 and can be changed to 2 or 4 on EPYC Rome (and, I suppose, on Milan). I am not aware of a report on the inner structure of EPYC Genoa's IOD, and if an NPS option exists in Genoa BIOSes accordingly.
ββββββββ
So that might be nice and all, but how to make best use of this on a given CPU and a given application, say PSP-LLR? This is something which I hesitate to comment on. The only EPYC which I have is 7452, i.e. Zen 2, 32c/64t, 8 CCXs per socket, 155β¦180 W configurable power budget per socket. The further another EPYC model deviates from one or more of these properties, the less the performance characteristics which I know from measurements on 7452 are true for another model. I could somewhat emulate some of the smaller EPYC Rome SKUs by switching 1, 2, or 3 cores per CCX off in the BIOS, or/and by switching CCDs off, but that's it. 64c/128t or Zen 3 CPUs would be rather different cattle of fish, let alone Zen 4.
Anyway; top performance is to be had via CPU "pinning", a.k.a. affinity. Reliance on these and other BIOS options can get you very or only somewhat, close to the optimum, depending on the particular use case. For example, I don't think there is an NPS=8 option for 64c/128t Rome CPUs.