This reduces the risk of damage from unforeseen skripting errors:
or more complete:
Then the script can't do anything what the pseudo-user boinc isn't allowed to do, such as
- - - - - - - -
workunit: 225931*2^34726136+1 for 91,933.52 credits,
on Zen 4: all-complex AVX-512 FFT length 3600K (28.125 MBytes),
on Zen 2 and Broadwell-EP: zero-padded FMA3 FFT length 3840K (30.0 MBytes)
software: SuSE Linux, display-manager shut down, sllr2_1.3.0_linux64_220821, 15 minutes/test
hardware: dual EPYC 7452 (Zen 2 Rome, 2× 32c/64t, 2× 8×16 MB L3$), cTDP = PPT = 180 W, 2×8 channels of DDR4-3200
A note on the "affinity = none" test: The outcome there is sensitive to the random nature of thread scheduling. If I re-ran this test, or ran it for much longer, I might get notably higher or lower results.
- - - - - - - -
hardware: dual Xeon E5-2696 v4 (Broadwell-EP, 2× 22c/44t, 2× 55 MB L3$), unlimited all-core turbo duration, 2×4 channels of DDR4-2133
There is quite some overhead here, due to memory accesses especially in the 4×11 test, and inter-thread synchronization especially in the 2×22 test. Due to the overhead, this host does not maintain its all-core turbo clock speed which is 2.6 GHz in AVX2 workloads. In PrimeGrid subprojects with considerably smaller workunits, 2.6 GHz could be maintained, causing way over 500 W host power consumption.
Still, LLR2's performance scaling to 22 threads per task is pretty good, much better than many other multithreaded programs, at least on this kind of host hardware with a big unified inclusive CPU cache.
- - - - - - - -
The workunit from above tests completed after 15.6 hours on the 9554P:
hardware: EPYC 9554P (Zen 4 Genoa 64c/128t, 8×32 MB L3$), cTDP = PPT = 400 W, 12 channels of DDR5-4800
running 1 task of the 225931*2^34726136+1 workunit along with 7 other random llrPSP tasks
¹) if all tasks had the same performance
The first eight tasks which this host completed after the start of the challenge had 3600K (6×) and 3456K (2×) AVX-512 FFT length respectively.
³) per host, if it ran only this type of tasks, 8 at once
That is, my earlier 20 minutes short test with a 3456K unit underestimated the actual performance for this type of workunits, but is not representative for the host performance with the slightly bigger type of workunits. (PrimeGrid estimates the credit for each workunit a priori, based on expected computational workloads, with the goal of keeping PPD constant per subproject. But no estimation is perfect. The -14% drop of PPD from 3456K units to 3600K units is a bit surprising to me.) Still, between these two workunit types, the relative performance of the various threadcount/affinity combos which I tested should be very similar on the Zen computers with 16 or 32 MBytes L3$ segments.
In contrast, for the tests of the Broadwell-EP with its 55 MB L3$, it made sense to me to wait for a 30.0 MBytes large FMA3 FFT workunit to show up.
sudo -u boinc ./llr2_affinity.sh
or more complete:
sudo -u boinc -g boinc ./llr2_affinity.sh
Then the script can't do anything what the pseudo-user boinc isn't allowed to do, such as
FORMAT C:
. :-)- - - - - - - -
I ran further tests but with a more recent, bigger workunit.I couldn't find a validated PSP result from a 3840K workunit. So I took a 3456K workunit from the results table of Pavel Atnashev's computer cluster instead. (That's 27.0 MB cache footprint of FFT coefficients.) It's the WU with the largest credit on this host when I looked about two hours ago. I ran this WU for 20 minutes per test and extrapolated total duration from the progress made until then.
workunit: 222113*2^34206293+1 for 82,165.65 credits
software: SuSE Linux, display-manager shut down, sllr2_1.3.0_linux64_220821
hardware: EPYC 9554P (Zen 4 Genoa 64c/128t), cTDP = PPT = 400 W, 12 channels of DDR5-4800
test affinity avg. duration avg. tasks/day avg. PPD avg. core clock host power power efficiency 8×8 none (random scheduling by Linux) 35:49:20 (128960 s) 5.4 0.440 M 3.60 GHz 370 W 1.19 kPPD/W8×8 1 task : 1 CCX, only lower SMT threads 12:52:37 (46357 s) 14.9 1.225 M 3.34 GHz 485 W 2.53 kPPD/W8×16 1 task : 1 CCX, all SMT threads 13:02:32 (46952 s) 14.7 1.210 M 3.05 GHz 500 W 2.42 kPPD/W4×16 1 task : 2 CCXs, only lower SMT threads 8:35:14 (30914 s) 11.1 0.919 M 3.60 GHz 480 W 1.91 kPPD/W4×32 1 task : 2 CCXs, all SMT threads 8:39:42 (31182 s) 11.0 0.911 M 3.18 GHz 490 W 1.86 kPPD/W
workunit: 225931*2^34726136+1 for 91,933.52 credits,
on Zen 4: all-complex AVX-512 FFT length 3600K (28.125 MBytes),
on Zen 2 and Broadwell-EP: zero-padded FMA3 FFT length 3840K (30.0 MBytes)
software: SuSE Linux, display-manager shut down, sllr2_1.3.0_linux64_220821, 15 minutes/test
hardware: dual EPYC 7452 (Zen 2 Rome, 2× 32c/64t, 2× 8×16 MB L3$), cTDP = PPT = 180 W, 2×8 channels of DDR4-3200
test | affinity | avg. duration | avg. tasks/day | avg. PPD | avg. core clock | host power | power efficiency |
---|---|---|---|---|---|---|---|
8×8 | none (random scheduling by Linux) | 53:36:35 (192995 s) | 3.6 | 0.329 M | 3.11 GHz | 460 W | 0.72 kPPD/W |
8×8 | 1 task : 2 CCXs, only lower SMT threads | 32:23:52 (116632 s) | 5.93 | 0.545 M | 2.87 GHz | 480 W | 1.14 kPPD/W |
8×16 | 1 task : 2 CCXs, all SMT threads | 32:59:08 (118748 s) | 5.82 | 0.535 M | 2.75 GHz | 495 W | 1.08 kPPD/W |
4×16 | 1 task : 4 CCXs, only lower SMT threads | 17:02:29 (61349 s) | 5.63 | 0.518 M | 2.94 GHz | 460 W | 1.13 kPPD/W |
4×32 | 1 task : 4 CCXs, all SMT threads | 17:12:43 (61963 s) | 5.58 | 0.513 M | 2.83 GHz | 475 W | 1.08 kPPD/W |
A note on the "affinity = none" test: The outcome there is sensitive to the random nature of thread scheduling. If I re-ran this test, or ran it for much longer, I might get notably higher or lower results.
- - - - - - - -
hardware: dual Xeon E5-2696 v4 (Broadwell-EP, 2× 22c/44t, 2× 55 MB L3$), unlimited all-core turbo duration, 2×4 channels of DDR4-2133
test | affinity | avg. duration | avg. tasks/day | avg. PPD | avg. core clock | host power | power efficiency |
---|---|---|---|---|---|---|---|
4×11 | 2 tasks : 1 socket, only lower SMT threads | 28:04:01 (101041 s) | 3.42 | 0.314 M | 1.95 GHz | 475 W | 0.66 kPPD/W |
2×22 | 1 task : 1 socket, only lower SMT threads | 12:58:18 (46698 s) | 3.70 | 0.340 M | 1.95 GHz | 440 W | 0.77 kPPD/W |
There is quite some overhead here, due to memory accesses especially in the 4×11 test, and inter-thread synchronization especially in the 2×22 test. Due to the overhead, this host does not maintain its all-core turbo clock speed which is 2.6 GHz in AVX2 workloads. In PrimeGrid subprojects with considerably smaller workunits, 2.6 GHz could be maintained, causing way over 500 W host power consumption.
Still, LLR2's performance scaling to 22 threads per task is pretty good, much better than many other multithreaded programs, at least on this kind of host hardware with a big unified inclusive CPU cache.
- - - - - - - -
The workunit from above tests completed after 15.6 hours on the 9554P:
hardware: EPYC 9554P (Zen 4 Genoa 64c/128t, 8×32 MB L3$), cTDP = PPT = 400 W, 12 channels of DDR5-4800
running 1 task of the 225931*2^34726136+1 workunit along with 7 other random llrPSP tasks
setup | affinity | actual duration | tasks/day¹ | PPD¹ | avg. core clock | host power | power efficiency¹ |
---|---|---|---|---|---|---|---|
8×8 | 1 task : 1 CCX, only lower SMT threads | 15:36:16 (56176 s) | 12.3 | 1.131 M | 3.57 GHz | 475 W | 2.38 kPPD/W |
The first eight tasks which this host completed after the start of the challenge had 3600K (6×) and 3456K (2×) AVX-512 FFT length respectively.
The six 3600K (28.125 MBytes) units took ≈56,300 s (15.6 h)² and gave 1.13 MPPD (2.38 kPPD/W)³ on average.
The two 3456K (27.0 MBytes) units took ≈43,600 s (12.1 h)² and gave 1.32 MPPD (≈2.75 kPPD/W)³ on average.
²) per taskThe two 3456K (27.0 MBytes) units took ≈43,600 s (12.1 h)² and gave 1.32 MPPD (≈2.75 kPPD/W)³ on average.
³) per host, if it ran only this type of tasks, 8 at once
That is, my earlier 20 minutes short test with a 3456K unit underestimated the actual performance for this type of workunits, but is not representative for the host performance with the slightly bigger type of workunits. (PrimeGrid estimates the credit for each workunit a priori, based on expected computational workloads, with the goal of keeping PPD constant per subproject. But no estimation is perfect. The -14% drop of PPD from 3456K units to 3600K units is a bit surprising to me.) Still, between these two workunit types, the relative performance of the various threadcount/affinity combos which I tested should be very similar on the Zen computers with 16 or 32 MBytes L3$ segments.
In contrast, for the tests of the Broadwell-EP with its 55 MB L3$, it made sense to me to wait for a 30.0 MBytes large FMA3 FFT workunit to show up.
Last edited: