321-LLR, "321 (LLR) 8.00" application
(name for app_config.xml: llr321)
I checked
dual-processor Linux hosts with two different configurations per processor type.
Config 1:
4 threads/task, use 50 % of hardware threads
(Hyperthreading is on but not used)
Config 2:
4 tasks/socket, use 100 % of hardware threads
(Hyperthreading is on and used)
The latter config gives slightly better throughput.
dual E5-2690v4 (2x 14C/28T = 56T) @ 2.9 GHz,
7 tasks at a time x 4 threads/task:
184,000 PPD
35 minutes mean time between task completions (4h06m average task duration)
8 tasks at a time x 7 threads/task:
208,000 PPD (+13 %)
31 minutes mean time between task completions (4h08m average task duration)
dual E5-2696v4 (2x 22C/44T = 88T) @ 2.6 GHz,
11 tasks at a time x 4 threads/task:
262,000 PPD
25 minutes mean time between task completions (4h32m average task duration)
8 tasks at a time x 11 threads/task:
270,000 PPD (+3 %)
24 minutes mean time between task completions (3h12m average task duration)
------------
Furthermore, I ran a few configurations on a
single-processor host, socket 1150, Xeon E3-1245v3 (Haswell 4C/8T) @ 3.4 GHz, Linux. In each of these tests, I ran always the very same WU, getting an error of measurement below 0.2 %. The WU in this test had an FFT length of 768K (data size: 6.0 MB).
1 process with 8 threads:
12,460 s (3.46 h) run time, 89,390 s (24.8 h) CPU time, 33,800 PPD
1 process with 7 threads:
12,520 s (3.48 h) run time, 81,320 s (22.6 h) CPU time, 33,600 PPD
1 process with 6 threads:
12,790 s (3.55 h) run time, 72,560 s (20.2 h) CPU time, 32,900 PPD
2 processes with 4 threads each:
33,570 s (9.33 h) run time, 248,950 s (69.2 h) CPU time, 25,000 PPD
4 processes with 2 threads each:
about 22 h run time, about about 21,000 PPD
(extrapolated after 15% completion)
2 processes with 2 threads each:
about 9.4 h run time, about 24,800 PPD
(extrapolated after 50% completion)
Conclusion:
On the 4C/8T Haswell with dual-channel memory, it is best to run only one task at a time, leave Hyperthreading enabled, and give the task as many processor threads as can be spared.
------------
Edit September 10, 2017:
fixed typo in tasks/socket, clarified use of Hyperthreading, increased accuracy of average task duration
Edit March 31, 2018:
added Xeon E3 results
Edits April 2-5, 2018:
Xeon E3 ran at 3.4 GHz, typo in s->h conversion
Edit October 20, 2019:
added FFT length of the Xeon E3 tests