Info PrimeGrid Challenges 2024, sieve-free edition

StefanR5R · Nov 14, 2024

This reduces the risk of damage from unforeseen skripting errors:
sudo -u boinc ./llr2_affinity.sh
or more complete:
sudo -u boinc -g boinc ./llr2_affinity.sh
Then the script can't do anything what the pseudo-user boinc isn't allowed to do, such as FORMAT C:. :-)

- - - - - - - -

StefanR5R said:
I couldn't find a validated PSP result from a 3840K workunit. So I took a 3456K workunit from the results table of Pavel Atnashev's computer cluster instead. (That's 27.0 MB cache footprint of FFT coefficients.) It's the WU with the largest credit on this host when I looked about two hours ago. I ran this WU for 20 minutes per test and extrapolated total duration from the progress made until then.

workunit: 222113*2^34206293+1 for 82,165.65 credits
software: SuSE Linux, display-manager shut down, sllr2_1.3.0_linux64_220821
hardware: EPYC 9554P (Zen 4 Genoa 64c/128t), cTDP = PPT = 400 W, 12 channels of DDR5-4800

test affinity avg. duration avg. tasks/day avg. PPD avg. core clock host power power efficiency
8×8 none (random scheduling by Linux) 35:49:20 (128960 s)
5.4
0.440 M
3.60 GHz
370 W
1.19 kPPD/W
8×8 1 task : 1 CCX, only lower SMT threads 12:52:37 (46357 s)
14.9
1.225 M
3.34 GHz
485 W
2.53 kPPD/W
8×16 1 task : 1 CCX, all SMT threads 13:02:32 (46952 s)
14.7
1.210 M
3.05 GHz
500 W
2.42 kPPD/W
4×16 1 task : 2 CCXs, only lower SMT threads 8:35:14 (30914 s)
11.1
0.919 M
3.60 GHz
480 W
1.91 kPPD/W
4×32 1 task : 2 CCXs, all SMT threads 8:39:42 (31182 s)
11.0
0.911 M
3.18 GHz
490 W
1.86 kPPD/W

I ran further tests but with a more recent, bigger workunit.

workunit: 225931*2^34726136+1 for 91,933.52 credits,
on Zen 4: all-complex AVX-512 FFT length 3600K (28.125 MBytes),
on Zen 2 and Broadwell-EP: zero-padded FMA3 FFT length 3840K (30.0 MBytes)
software: SuSE Linux, display-manager shut down, sllr2_1.3.0_linux64_220821, 15 minutes/test

hardware: dual EPYC 7452 (Zen 2 Rome, 2× 32c/64t, 2× 8×16 MB L3$), cTDP = PPT = 180 W, 2×8 channels of DDR4-3200

test	affinity	avg. duration	avg. tasks/day	avg. PPD	avg. core clock	host power	power efficiency
8×8	none (random scheduling by Linux)	53:36:35 (192995 s)	3.6	0.329 M	3.11 GHz	460 W	0.72 kPPD/W
8×8	1 task : 2 CCXs, only lower SMT threads	32:23:52 (116632 s)	5.93	0.545 M	2.87 GHz	480 W	1.14 kPPD/W
8×16	1 task : 2 CCXs, all SMT threads	32:59:08 (118748 s)	5.82	0.535 M	2.75 GHz	495 W	1.08 kPPD/W
4×16	1 task : 4 CCXs, only lower SMT threads	17:02:29 (61349 s)	5.63	0.518 M	2.94 GHz	460 W	1.13 kPPD/W
4×32	1 task : 4 CCXs, all SMT threads	17:12:43 (61963 s)	5.58	0.513 M	2.83 GHz	475 W	1.08 kPPD/W

A note on the "affinity = none" test: The outcome there is sensitive to the random nature of thread scheduling. If I re-ran this test, or ran it for much longer, I might get notably higher or lower results.

- - - - - - - -

hardware: dual Xeon E5-2696 v4 (Broadwell-EP, 2× 22c/44t, 2× 55 MB L3$), unlimited all-core turbo duration, 2×4 channels of DDR4-2133

test	affinity	avg. duration	avg. tasks/day	avg. PPD	avg. core clock	host power	power efficiency
4×11	2 tasks : 1 socket, only lower SMT threads	28:04:01 (101041 s)	3.42	0.314 M	1.95 GHz	475 W	0.66 kPPD/W
2×22	1 task : 1 socket, only lower SMT threads	12:58:18 (46698 s)	3.70	0.340 M	1.95 GHz	440 W	0.77 kPPD/W

There is quite some overhead here, due to memory accesses especially in the 4×11 test, and inter-thread synchronization especially in the 2×22 test. Due to the overhead, this host does not maintain its all-core turbo clock speed which is 2.6 GHz in AVX2 workloads. In PrimeGrid subprojects with considerably smaller workunits, 2.6 GHz could be maintained, causing way over 500 W host power consumption.

Still, LLR2's performance scaling to 22 threads per task is pretty good, much better than many other multithreaded programs, at least on this kind of host hardware with a big unified inclusive CPU cache.

- - - - - - - -

The workunit from above tests completed after 15.6 hours on the 9554P:

hardware: EPYC 9554P (Zen 4 Genoa 64c/128t, 8×32 MB L3$), cTDP = PPT = 400 W, 12 channels of DDR5-4800
running 1 task of the 225931*2^34726136+1 workunit along with 7 other random llrPSP tasks

setup	affinity	actual duration	tasks/day¹	PPD¹	avg. core clock	host power	power efficiency¹
8×8	1 task : 1 CCX, only lower SMT threads	15:36:16 (56176 s)	12.3	1.131 M	3.57 GHz	475 W	2.38 kPPD/W

¹) if all tasks had the same performance

The first eight tasks which this host completed after the start of the challenge had 3600K (6×) and 3456K (2×) AVX-512 FFT length respectively.

The six 3600K (28.125 MBytes) units took ≈56,300 s (15.6 h)² and gave 1.13 MPPD (2.38 kPPD/W)³ on average.
The two 3456K (27.0 MBytes) units took ≈43,600 s (12.1 h)² and gave 1.32 MPPD (≈2.75 kPPD/W)³ on average.

²) per task
³) per host, if it ran only this type of tasks, 8 at once

That is, my earlier 20 minutes short test with a 3456K unit underestimated the actual performance for this type of workunits, but is not representative for the host performance with the slightly bigger type of workunits. (PrimeGrid estimates the credit for each workunit a priori, based on expected computational workloads, with the goal of keeping PPD constant per subproject. But no estimation is perfect. The -14% drop of PPD from 3456K units to 3600K units is a bit surprising to me.) Still, between these two workunit types, the relative performance of the various threadcount/affinity combos which I tested should be very similar on the Zen computers with 16 or 32 MBytes L3$ segments.

In contrast, for the tests of the Broadwell-EP with its 55 MB L3$, it made sense to me to wait for a 30.0 MBytes large FMA3 FFT workunit to show up.

StefanR5R · Nov 14, 2024

PS,
a properly configured Ryzen 9 7950X outperforms this dual Xeon E5-2696 v4, or am I mistaken?

(That is, can do two 3600K tasks at once in under 46,000 seconds.)

emoga · Nov 14, 2024

StefanR5R said:
PS,
a properly configured Ryzen 9 7950X outperforms this dual Xeon E5-2696 v4, or am I mistaken?

(That is, can do two 3600K tasks at once in under 46,000 seconds.)

Using workunit: 225931*2^34726136+1 for 91,933.52 credits on a 7950x (4.5 / 1.0V)
Power was measured at the wall.

setup	affinity	actual duration	tasks/day	PPD	avg. core clock	host power	power efficiency
2×8	ascending	11:08:55 (40135 s)	4.305	395,773	4.5 GHz	211 W	1.88 kPPD/W

Orange Kid · Nov 14, 2024

Question using affinity
If I configured 2 tasks at 16 threads each on a 5950x could I use
0-7,16-23 8-15,24-31
or would
0-15,16-31 be best
If I am remembering correctly 0-7 " are real cores" and 16-23 are the "virtual cores" on the same ccx and 8-15 and 24-31 are on the other or am I completely wrong.
trying to use all "cores" on one ccx per task

Ken g6 · Nov 15, 2024

Day 2 stats:

Rank___Credits____Username
1______16010194___markfw
4______7711744____w a h
6______5461176____Icecold
11_____3672649____crashtech
16_____3010042____ChelseaOilman
17_____2789036____cellarnoise2
44_____867732_____mmonnin
59_____489164_____Orange Kid
90_____267917_____waffleironhead
112____176237_____StephieDolores
114____175969_____Ken_g6

Rank__Credits____Team
1_____40631866___TeAm AnandTech
2_____14236956___Czech National Team
3_____12349651___Antarctic Crunchers
4_____12236424___SETI.Germany

StefanR5R · Nov 15, 2024

Ken g6 said:
Day 2 stats:

14,236,956 + 12,349,651 + 12,236,424 = 38,823,031 < 40,631,866 :-O

Orange Kid said:
Question using affinity

First of all, the throughput loss when cache-aligned CPU affinity isn't assigned on a dual-CCX Ryzen is probably not as large as on my 2P Rome (sixteen CCXs) or 1P Genoa (eight CCXs) because the EPYCs have so many more cache boundaries which are getting in the way. On the other hand, the gap between core speed and RAM speed tends to be a bit bigger on Ryzen systems than on EPYCs.

Now to the numbering part of your questions:
From what I have heard, it differs between Windows and Linux. On Linux, it should be as you remember:

5950X:

0-7 = lower threads of CCX0, 8-15 = lower threads of CCX1,
16-23 = upper threads of CCX0, 24-31 = upper threads of CCX1

5900X:

0-5 = lower threads of CCX0, 6-11 = lower threads of CCX1,
12-17 = upper threads of CCX0, 18-23 = upper threads of CCX1

This can be verified with various tools. One which is available on probably all Linux installations is lscpu -e. In its output, threads which belong to the same physical core are attached to the same level 1 cache. And threads which belong to the same CCX are attached to the same level 3 cache.

On 5950X, my guess is that it's best to run one task on CPUs 0-7 and the other on CPUs 8-15 (8 threads/task, leaving half of the host's hardware threads alone; background stuff can run there). Ditto, on 5900X one task on 0-5 and the other on 6-11 (6 threads/task). However, maybe somebody here with a Ryzen 59#0X has actually measured what's best...?

mmonnin03 · Nov 15, 2024

Orange Kid said:
Question using affinity
If I configured 2 tasks at 16 threads each on a 5950x could I use
0-7,16-23 8-15,24-31
or would
0-15,16-31 be best
If I am remembering correctly 0-7 " are real cores" and 16-23 are the "virtual cores" on the same ccx and 8-15 and 24-31 are on the other or am I completely wrong.
trying to use all "cores" on one ccx per task

Assuming Process Lasso is correct, when I tell it to disable SMT it disables all of the odd numbered cores.

Markfw · Nov 15, 2024

mmonnin03 said:
Assuming Process Lasso is correct, when I tell it to disable SMT it disables all of the odd numbered cores.

when you say disables, maybe it pins tasks only to the even numbered, so it simulates that.

Orange Kid · Nov 15, 2024

When I use Stefs affinity script using blocks of 8 no smt, it assigns one task 0-7 and one task 8-15. Using that logic I thought if I assigned the threads per ccx, per task, it may be more effective (faster).
You know, the old adage, if some is good more is better.
Thanks for the answers.

StefanR5R · Nov 15, 2024

mmonnin03 said:
Assuming Process Lasso is correct, when I tell it to disable SMT it disables all of the odd numbered cores.

On Windows, this is the numbering of logical CPUs which I have read elsewhere (but don't know how to verify):

5950X, while SMT is enabled in the BIOS:

0,2,4,6,8,10,12,14 = lower threads of CCX0, 16,18,20,22,24,26,28,30 = lower threads of CCX1,
1,3,5,7,9,11,13,15 = upper threads of CCX0, 17,19,21,23,25,27,29,31 = upper threads of CCX1

5900X, while SMT is enabled in the BIOS:

0,2,4,6,8,10 = lower threads of CCX0, 12,14,16,18,20,22 = lower threads of CCX1,
1,3,5,7,9,11 = upper threads of CCX0, 13,15,17,19,21,23 = upper threads of CCX1

StefanR5R · Nov 15, 2024

Orange Kid said:
If I configured 2 tasks at 16 threads each on a 5950x could I use
0-7,16-23 8-15,24-31
or would
0-15,16-31 be best

OK, I haven't really directly answered this part of your post. The first CPU list is the better one, as this is indeed putting all threads of one CCX exclusively for one task, and then all threads of the other CCX exclusively to the other task.

(However, 8 program threads per task, and 1 physical core : 1 program thread would likely be a little bit better... is my guess. Edit: That is, merely 8 program threads per task, but still only 2 simultaneous tasks on the host.)

PS: These CPU numbers are only valid on Linux. On Windows, see above.

Orange Kid · Nov 15, 2024

StefanR5R said:
OK, I haven't really directly answered this part of your post. The first CPU list is the better one, as this is indeed putting all threads of one CCX exclusively for one task, and then all threads of the other CCX exclusively to the other task.

(However, 8 program threads per task, and 1 physical core : 1 program thread would likely be a little bit better... is my guess.)

PS: These CPU numbers are only valid on Linux. On Windows, see above.

I guess I should have mentioned that I use Linux.
I remembered seeing somewhere the numbering for threads but could not find it after many searches.
Using your affinity script with blocks of 8 no smt, I get 0-7 and 8-15
Using it with blocks of 16, I get 0-15 and 16-31
This then confused me (easily done) and made me second guess myself.
I shall let them run using full threads and see what happens. They are dedicated DC boxes.

StefanR5R · Nov 15, 2024

The syntax "blocks of ..." is the script author's shorthand for consecutive numbering, "block" meaning just a "block of numbers". That is, it definitely does not mean "figure out how the hardware is organized into 'blocks' or 'complexes' or 'domains' or whatever".

There is indeed no shorthand for "each block = all hardware threads of a CCX". Not because it'd be impossible or difficult to add a shorthand for this case, but because the author hasn't gotten around to add one yet. You have to resort to write the lists of CPU numbers explicitly.

crashtech · Nov 15, 2024

StefanR5R said:
(snip)
On 5950X, my guess is that it's best to run one task on CPUs 0-7 and the other on CPUs 8-15 (8 threads/task, leaving half of the host's hardware threads alone; background stuff can run there). Ditto, on 5900X one task on 0-5 and the other on 6-11 (6 threads/task). However, maybe somebody here with a Ryzen 59#0X has actually measured what's best...?

My testing of PSP on the 5950X shows a small regression in PPD when using SMT threads even with Linux affinity set to 0-7,16-23 8-15,24-31. However, afaict, the 7950X shows SMT to be a small help, though it probably is a regression on PPD/watt, I haven't tested that.

Ken g6 · Nov 16, 2024

Day 3 stats:

Rank___Credits____Username
1______24611606___markfw
5______11966306___w a h
9______7518588____cellarnoise2
11_____6423280____Icecold
13_____6058677____ChelseaOilman
14_____5897854____crashtech
39_____1661832____mmonnin
66_____755220_____Orange Kid
75_____619605_____waffleironhead
111____360372_____Ken_g6
114____354779_____biodoc
115____352940_____StephieDolores
132____278443_____johnnevermind
219____12756______xxshanshon

Rank__Credits____Team
1_____66872263___TeAm AnandTech
2_____25030267___Czech National Team
3_____23672161___Antarctic Crunchers
4_____22665242___AMD Users

StefanR5R · Nov 16, 2024

number of participants of the top ten teams, and median of their individual scores:

... 14 ....... 1,209k ........ TeAm AnandTech
... 32 .......... 407k ........ Czech National Team
... 12 .......... 470k ........ Antarctic Crunchers
..... 7 .......... 260k ........ AMD Users
... 22 .......... 430k ........ SETI.Germany
..... 1 ..... 16,392k ........ Romania
..... 6 ....... 1,102k ........ Aggie The Pew
..... 9 .......... 352k ........ Ukraine
..... 4 ....... 1,925k ........ BOINC@MIXI
..... 1 ..... 10,444k ........ Ural Federal University

Ken g6 · Nov 17, 2024

Day 4 stats:

Rank___Credits____Username
1______35216675___markfw
6______15539186___w a h
10_____10204796___cellarnoise2
11_____8840990____crashtech
12_____8572735____ChelseaOilman
18_____7491777____Icecold
39_____2458049____mmonnin
65_____1183202____Orange Kid
90_____727140_____biodoc
96_____704104_____waffleironhead
125____453378_____Ken_g6
139____362950_____johnnevermind
145____352940_____StephieDolores
200____105768_____xxshanshon
282____1455_______SlangNRox

Rank__Credits____Team
1_____92215153___TeAm AnandTech
2_____37216282___Czech National Team
3_____35106174___Antarctic Crunchers
4_____34807515___AMD Users

Ken g6 · Nov 18, 2024

Day 5 stats:

Rank___Credits____Username
1______46285444___markfw
6______19113066___w a h
10_____12948987___cellarnoise2
12_____11570548___ChelseaOilman
13_____11298990___crashtech
20_____8103498____Icecold
42_____2907579____mmonnin
67_____1546329____Orange Kid
88_____1080424____biodoc
98_____965188_____waffleironhead
115____727736_____johnnevermind
125____639927_____Ken_g6
165____352940_____StephieDolores
223____105768_____xxshanshon
320____1455_______SlangNRox

Rank__Credits____Team
1_____117647887___TeAm AnandTech
2_____51068086___Czech National Team
3_____47628145___Antarctic Crunchers
4_____46696108___AMD Users

Ken g6 · Nov 19, 2024

Day 6 stats:

Rank___Credits____Username
2______56844754___markfw
6______23139273___w a h
11_____14722877___cellarnoise2
12_____14260987___crashtech
13_____14113089___ChelseaOilman
21_____9288531____Icecold
47_____3448382____mmonnin
67_____1908793____Orange Kid
81_____1446609____biodoc
89_____1320981____waffleironhead
110____997281_____johnnevermind
134____733648_____Ken_g6
169____437322_____StephieDolores
246____105768_____xxshanshon
352____1455_______SlangNRox

Rank__Credits____Team
1_____142769757___TeAm AnandTech
2_____66220790___Czech National Team
3_____62704884___Antarctic Crunchers
4_____62194555___AMD Users

cellarnoise · Nov 19, 2024

Well the TeAm is doing great!

Let's see what the final outcome looks like? A few members are moving down the stack a little bit!

But it is great to see others moving on up!

Ken g6 · Nov 20, 2024

More-or-less final stats:

Rank___Credits____Username
2______67858580___markfw
4______40409619___crashtech
5______36445715___w a h
8______32139817___Icecold
14_____19224902___cellarnoise2
15_____16738054___ChelseaOilman
54_____3541952____mmonnin
78_____1908793____Orange Kid
82_____1806413____biodoc
101____1508370____waffleironhead
107____1370718____johnnevermind
131____921725_____Ken_g6
181____437322_____StephieDolores
269____105768_____xxshanshon
378____1455_______SlangNRox

Rank__Credits____Team
1_____224419210___TeAm AnandTech
2_____90911081___Czech National Team
3_____79739432___AMD Users
4_____74411166___Antarctic Crunchers

StefanR5R · Nov 24, 2024

TeAm AnandTech ranking in the last five seasons:

2020: 9, 9, 8, 7, 6, 16, 10, 14, 18
2021: 5, 7, 9, 1, 1, 1, 1, 1, 1
2022: 1, 2¹, 1, 2², 1, 1, 1, 1, 1³
2023: 1, 1, 1, 1, 1, 1, 1, 1, 1
2024: 1, 1, 1, 1⁴, 1, 1, 1, 1, ?⁵

________
¹) with extensive guest-computing by TeAm members for team Ukraine who won this
²) Antarctic Crunchers won this, by Gelly of AC duking the individuals' race out with Skillz
³) GFN-21 like in the upcoming challenge, plus GFN-22 and DYFL: TAAT won by making 22.8% of all points, AC = 11.2%, CNT = 10.1%, SG = 9.8%.
⁴) GFN-19: This was the latest combined GPU+CPU challenge, with GPUs generally having an edge in performance and performance/Watt over CPUs in this one. It ended with TAAT = 15.6% of all points, AC = 12.1%, SG = 10.8%, CNT = 10.4%.
⁵) GFN-21: This is going to be mostly a GPU challenge. Unless somebody has got a large number of X3D$ or HBM equipped CPUs at their disposal.

Markfw · Nov 26, 2024

So, I thought this next one was GPU only, yes ? even thought it can do GPU + CPU ? Is the CPU a waste, and thats why ?

Skillz · Nov 26, 2024

GPUs will do them much faster. CPUs can do them, but I believe the L3 cache requirement is around 40MB which very few CPUs have leading to very long run times on CPUs.

gsrcrxsi · Nov 26, 2024

Will only be good for the Intel CPUs which have that much L3 cache

test	affinity	avg. duration	avg. tasks/day	avg. PPD	avg. core clock	host power	power efficiency
8×8	none (random scheduling by Linux)	35:49:20 (128960 s)	5.4	0.440 M	3.60 GHz	370 W	1.19 kPPD/W
8×8	1 task : 1 CCX, only lower SMT threads	12:52:37 (46357 s)	14.9	1.225 M	3.34 GHz	485 W	2.53 kPPD/W
8×16	1 task : 1 CCX, all SMT threads	13:02:32 (46952 s)	14.7	1.210 M	3.05 GHz	500 W	2.42 kPPD/W
4×16	1 task : 2 CCXs, only lower SMT threads	8:35:14 (30914 s)	11.1	0.919 M	3.60 GHz	480 W	1.91 kPPD/W
4×32	1 task : 2 CCXs, all SMT threads	8:39:42 (31182 s)	11.0	0.911 M	3.18 GHz	490 W	1.86 kPPD/W

Info PrimeGrid Challenges 2024, sieve-free edition

Elite Member

Elite Member

Senior member

Elite Member

Programming Moderator, Elite Member

Elite Member

Senior member

Moderator Emeritus, Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Lifer

Programming Moderator, Elite Member

Elite Member

Programming Moderator, Elite Member

Programming Moderator, Elite Member

Programming Moderator, Elite Member

Senior member

Programming Moderator, Elite Member

Elite Member

Moderator Emeritus, Elite Member

Golden Member

Member