PrimeGrid Challenges 2023

StefanR5R · Apr 19, 2023

emoga said:
the 2970wx only has 64MB L3

Worse; it's got eight 8 MB level-3 caches.

Markfw · Apr 19, 2023

StefanR5R said:
Worse; it's got eight 8 MB level-3 caches.

Yes, I need to find a cheaper Genoa for that box and retire the dinosaur 2970wx

StefanR5R · Apr 19, 2023

Markfw said:
I would love to create multiple instances and do pinning, the problem is, that after HOURS trying this on a simple 2 instance box (a 7950x) I failed miserably.

Was the launching of two client instances the problem, or the pinning?

If it was the former: This is in scope of various guides which different people have written on the subject. It's not hard for users with some BOINC experience, but also not something which can be rushed.

If it was just the latter: There are basically two ways to do it. A) Put the CPU pinning into the startup script of the client. Or: B) Just start the client normally, look up its PID, then pin it to CPUs after the fact. — In case of B it would have to be repeated at each client restart, e.g. after machine reboot. Also, it should be done before the client launches science tasks; otherwise you would have to perform the pinning also for the child processes, or stop and restart the tasks.

Case B on Linux:
pgrep -a boinc

lists processes with the pattern 'boinc' in their names. The leading number of each line is the PID. The syntax pgrep -ax boinc will only list processes which are named 'boinc' exactly, e.g. won't list boincmgr.

taskset -pc 0-7 12345

pins the task with PID=12345 to the first eight logical CPUs.

taskset -pc 8-15 54321

pins the task with PID=54321 to the next eight logical CPUs.

taskset -pc 12345
taskset -pc 54321

displays which CPUs the processes 12345 and 54321 are currently bound to, without modifying the binding.

Case A on Linux:
Use the taskset command without the 'p' and without PID as a prefix in boinc's startup command. For example,
taskset -c 0-8 /usr/bin/boinc --daemon --allow_multiple_clients --gui_rpc_port $PORT --dir $DIR
Note, everything in this line from /usr/bin/boinc onwards is in scope of the mentioned guides on multiple client instances.

________

However, on a single-socket computer, it's probably preferable to use a single boinc client instance, which is not pinned to CPUs, and only pin the processes of the science application to CPUs. There are a few implementations out there.

I never used any of these myself, because I am skeptical about their impact on NUMA computers, such as my dual-socket computers. That's because all of the implementations which I have seen set the CPU affinity _after_ the science application started — and already allocated memory which may become 'far memory' from the perspective of the CPUs to which the science task will be bound to.

Edit: Linux performs automatic "page migration" under circumstances, which moves virtual memory regions from far physical memory to near physical memory. But I don't know if this is effective enough for a use case such as moving PrimeGrid LLR tasks across NUMA nodes.

________

Regarding PSP-LLR on Ryzen 7950X:
Clearly, it should be configured such that each PSP-LLR task does not cross CCX boundaries. But whether or not SMT should be used on 7950X is something which I don't know.

StefanR5R · Apr 19, 2023

PS, a reminder:
EPYC users can switch on "ACPI SRAT L3 Cache as NUMA Domain" in the BIOS, or modify the "NUMA Nodes per Socket (NPS)" number in the BIOS. Then, simply let the Linux kernel's process scheduler do its thing. Its default policy is to try and keep all threads of a multithreaded process concentrated within one and the same NUMA domain. This is not as thorough as CPU pinning, but comes quite close.

crashtech · Apr 19, 2023

That setting for Epycs seems like it might be really handy, does it inform the OS to treat each CCX as if it was a separate CPU socket?

Markfw · Apr 20, 2023

My dual 7V12 is setup this way
mark@dual76v12:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 32
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7V12 64-Core Processor
Stepping: 0
Frequency boost: enabled
CPU MHz: 2000.000
CPU max MHz: 2450.0000
CPU min MHz: 1500.0000
BogoMIPS: 4900.17
Virtualization: AMD-V
L1d cache: 4 MiB
L1i cache: 4 MiB
L2 cache: 64 MiB
L3 cache: 512 MiB
NUMA node0 CPU(s): 0-3,128-131
NUMA node1 CPU(s): 4-7,132-135
NUMA node2 CPU(s): 8-11,136-139
NUMA node3 CPU(s): 12-15,140-143
NUMA node4 CPU(s): 16-19,144-147
NUMA node5 CPU(s): 20-23,148-151
NUMA node6 CPU(s): 24-27,152-155
NUMA node7 CPU(s): 28-31,156-159
NUMA node8 CPU(s): 32-35,160-163
NUMA node9 CPU(s): 36-39,164-167
NUMA node10 CPU(s): 40-43,168-171
NUMA node11 CPU(s): 44-47,172-175
NUMA node12 CPU(s): 48-51,176-179
NUMA node13 CPU(s): 52-55,180-183
NUMA node14 CPU(s): 56-59,184-187
NUMA node15 CPU(s): 60-63,188-191
NUMA node16 CPU(s): 64-67,192-195
NUMA node17 CPU(s): 68-71,196-199
NUMA node18 CPU(s): 72-75,200-203
NUMA node19 CPU(s): 76-79,204-207
NUMA node20 CPU(s): 80-83,208-211
NUMA node21 CPU(s): 84-87,212-215
NUMA node22 CPU(s): 88-91,216-219
NUMA node23 CPU(s): 92-95,220-223
NUMA node24 CPU(s): 96-99,224-227
NUMA node25 CPU(s): 100-103,228-231
NUMA node26 CPU(s): 104-107,232-235
NUMA node27 CPU(s): 108-111,236-239
NUMA node28 CPU(s): 112-115,240-243
NUMA node29 CPU(s): 116-119,244-247
NUMA node30 CPU(s): 120-123,248-251
NUMA node31 CPU(s): 124-127,252-255

Ken g6 · Apr 20, 2023

Day 4 stats:

Rank___Credits____Username
5______10934568___xii5ku
7______8744896____crashtech
11_____6424879____markfw
12_____6208493____w a h
25_____2949481____cellarnoise2
26_____2822772____[TA]Skillz
33_____2403266____mmonnin
67_____1240045____Fardringle
70_____1224214____fuzzydice555
73_____1172181____Orange Kid
90_____932575_____kiska
101____757762_____johnnevermind
111____674754_____biodoc
118____595999___10esseeTony
132____471304_____waffleironhead
133____470103_____Ken_g6
183____224867_____Letin Noxe

Rank__Credits____Team
1_____48252168___TeAm AnandTech
2_____38881060___Czech National Team
3_____37325581___Antarctic Crunchers
4_____35817767___Ural Federal University

Our lead is looking more comfortable.

StefanR5R · Apr 20, 2023

StefanR5R said:
[Setting up multiple client instances] is in scope of various guides which different people have written on the subject. It's not hard for users with some BOINC experience, but also not something which can be rushed.

Note, "some BOINC experience" on Linux in particular implies "some experience with Linux file ownership and access permissions" too. Which is trivial, except if one has only worked with single-user OSs all the time before, or with a multi-user OS which was made by the vendor to look like a single-user OS to the unsuspecting public. (I am of course referring to Windows NT which is made to look very much like DOS/Windows.)

crashtech said:
That setting for Epycs seems like it might be really handy, does it inform the OS to treat each CCX as if it was a separate CPU socket?

The computer in post #156 has got "ACPI SRAT L3 Cache as NUMA Domain" switched on in the BIOS. Hence, the firmware reported 32 NUMA nodes which coincide with the 32 last-level cache domains which this computer has.

There are also still 2 sockets reported by the firmware to the OS. But the OS's process scheduler cares for NUMA nodes, not for sockets.

Properties of a NUMA node are: Which logical CPUs belong to it, which physical memory range(s) are "near" to it (also, which block devices, network devices, and PCI devices are near to it), what relative distance its CPUs have to CPUs of the same and of other NUMA nodes. numactl --hardware shows some of these properties.

This relative distance is meant to reflect the latency hit which an access to "far" memory takes. The firmware reports this distance by means of an artificial weighting factor, not as a real physical measure. But it would for example make sense that all NUMA nodes which are located on the same socket are reported to have a shorter distance to each other than pairs of NUMA nodes which reside on different sockets. (I haven't checked if the Supermicro AMI BIOS which I am using reports differentiated NUMA distances, or simply same for all, when "ACPI SRAT L3 Cache as NUMA Domain" is enabled. Or when the "NUMA Nodes per Socket (NPS)" option is changed from the default.)

About the NPS option: The I/O die of EPYC Rome and Milan is internally segmented into four quadrants. Each quadrant is directly attached to 0…2 compute chiplets (depending on the EPYC SKU) and has got one dual-channel DDR4 memory controller. Chiplets have fastest access to the memory controller which sits on the same quadrant, and a tiny bit slower access to the memory controllers at the other three quadrants. This is the primary reason why AMD offer the NPS option, which defaults to 1 and can be changed to 2 or 4 on EPYC Rome (and, I suppose, on Milan). I am not aware of a report on the inner structure of EPYC Genoa's IOD, and if an NPS option exists in Genoa BIOSes accordingly.

————————

So that might be nice and all, but how to make best use of this on a given CPU and a given application, say PSP-LLR? This is something which I hesitate to comment on. The only EPYC which I have is 7452, i.e. Zen 2, 32c/64t, 8 CCXs per socket, 155…180 W configurable power budget per socket. The further another EPYC model deviates from one or more of these properties, the less the performance characteristics which I know from measurements on 7452 are true for another model. I could somewhat emulate some of the smaller EPYC Rome SKUs by switching 1, 2, or 3 cores per CCX off in the BIOS, or/and by switching CCDs off, but that's it. 64c/128t or Zen 3 CPUs would be rather different cattle of fish, let alone Zen 4.

Anyway; top performance is to be had via CPU "pinning", a.k.a. affinity. Reliance on these and other BIOS options can get you very or only somewhat, close to the optimum, depending on the particular use case. For example, I don't think there is an NPS=8 option for 64c/128t Rome CPUs.

StefanR5R · Apr 20, 2023

StefanR5R said:
Many of the current llrPSP workunits in progress are still at 22.5 MB, but the biggest have progressed to 24.0 MB now.

On Zen 2, I am now seeing some at 22.5 MB still, more at 24.0 MB, and quite a few at 25.0 MB.

Doesn't really make a difference for the optimum of how many tasks to run at once, but goes to show that computational effort can differ quite a bit between random workunits. But PrimeGrid sets credit accordingly.

Ken g6 · Apr 21, 2023

Day 5 stats:

Rank___Credits____Username
5______14293346___xii5ku
7______11186843___crashtech
10_____8795902____markfw
12_____7968397____w a h
21_____4425818____cellarnoise2
28_____3468339____[TA]Skillz
35_____2934970____mmonnin
68_____1625280____fuzzydice555
72_____1541837____Fardringle
73_____1536066____Orange Kid
74_____1491380____kiska
92_____1139513___10esseeTony
98_____1038225____biodoc
112____879972_____johnnevermind
129____645477_____Ken_g6
139____587668_____waffleironhead
183____275525_____Letin Noxe

Rank__Credits____Team
1_____63834565___TeAm AnandTech
2_____51245247___Czech National Team
3_____48121616___Antarctic Crunchers
4_____44782868___Ural Federal University

Ken g6 · Apr 22, 2023

Day 6 stats:

Rank___Credits____Username
5______17674606___xii5ku
7______14140459___crashtech
9______11831862___markfw
12_____9684407____w a h
23_____5251599____cellarnoise2
32_____3715876____mmonnin
33_____3702299____[TA]Skillz
69_____2033362____fuzzydice555
71_____2006135____kiska
73_____1958238____Orange Kid
75_____1955180____Fardringle
86_____1617909___10esseeTony
101____1221426____biodoc
113____1066085____johnnevermind
136____767181_____Ken_g6
145____697223_____waffleironhead
182____371686_____Letin Noxe

Rank__Credits____Team
1_____79695539___TeAm AnandTech
2_____63504505___Czech National Team
3_____60183504___Antarctic Crunchers
4_____55488614___Ural Federal University

Time for the usual suggestion about the end of the challenge:

At the Conclusion of the Challenge: We kindly ask users "moving on" to ABORT their tasks instead of DETACHING, RESETTING, or PAUSING. ABORTING tasks allows them to be recycled immediately; thus a much faster "clean up" to the end of a Challenge. DETACHING, RESETTING, and PAUSING tasks causes them to remain in limbo until they EXPIRE. Therefore, we must wait until tasks expire to send them out to be completed. Please consider either completing what's in the queue or ABORTING them. Thanks!

Though I think all you really ought to complete are any double-checks you have in the queue.

crashtech · Apr 22, 2023

I don't think I'm sure atm how to differentiate between the 2 kinds of tasks.

waffleironhead · Apr 22, 2023

crashtech said:
I don't think I'm sure atm how to differentiate between the 2 kinds of tasks.

double check tasks end with a c near the end in the id string.

so llrPSP_12345678c_0

Skillz · Apr 22, 2023

Most of the time the DC tasks will have a much shorter deadline compared to the main tasks.

TennesseeTony · Apr 23, 2023

crashtech said:
I don't think I'm sure atm how to differentiate between the 2 kinds of tasks.

Sort tasks by estimated time if actually in front of the computer. Not sure if that can be done using a multi-computer manager.

Fardringle · Apr 23, 2023

TennesseeTony said:
Sort tasks by estimated time if actually in front of the computer. Not sure if that can be done using a multi-computer manager.

BoincTasks can sort by estimated remaining time on tasks in progress as well as tasks that have not been started yet.

Markfw · Apr 23, 2023

I have one unit left that will finish in one minute... Still think I can hold 9th.

crashtech · Apr 23, 2023

TennesseeTony said:
Sort tasks by estimated time if actually in front of the computer. Not sure if that can be done using a multi-computer manager.

It actually turned out to be pretty easy to see the "c" in the filename, I had been fixated on the last number and had been missing the letter which was in the filename.

Ken g6 · Apr 23, 2023

More-or-less final stats:

Rank___Credits____Username
5______21217051___xii5ku
7______16789801___crashtech
9______14192579___markfw
13_____11495392___w a h
24_____6138467____cellarnoise2
34_____4346596____mmonnin
40_____3702299____[TA]Skillz
66_____2549646____kiska
71_____2310112____Orange Kid
73_____2267137___10esseeTony
74_____2262943____fuzzydice555
77_____2181164____Fardringle
111____1243549____johnnevermind
113____1221426____biodoc
140____888014_____Ken_g6
142____878546_____waffleironhead
177____492970_____Letin Noxe

Rank__Credits____Team
1_____94177699___TeAm AnandTech
2_____78716707___Czech National Team
3_____71684471___Antarctic Crunchers
4_____64596646___Ural Federal University

StefanR5R · Apr 23, 2023

Thanks for the stats, @Ken g6.

Ken g6 said:
1_____94177699___TeAm AnandTech

Nice work, TeAm!

Markfw said:
Still think I can hold 9th.

Yep, you made it. :-)

Teams with members who made it into the Top Ten of the individual ranking:
TeAm AnandTech – 3, Aggie The Pew – 1, AMD Users – 1, Antarctic Crunchers – 1, BOINC@AUSTRALIA – 1, BOINC@MIXI – 1, Sicituradastra. – 1, Ural Federal University – 1

Teams with members in the individual Top Twenty:
TeAm AnandTech – 4, Antarctic Crunchers – 2, BOINC@MIXI – 2, Czech National Team – 2, Aggie The Pew – 1, AMD Users – 1, BOINC@AUSTRALIA – 1, BOINC@Poland – 1, none – 1, Rechenkraft.net – 1, SETI.Germany – 1, SETI.USA – 1, Sicituradastra. – 1, Ural Federal University – 1

Head counts of the top five teams:
TeAm AnandTech ...................... 17 participants, ranking 5th…177th individually
Czech National Team ............... 46 participants, ranking 11th…362nd
Antarctic Crunchers .................. 20 participants, ranking 4th to 333rd
Ural Federal University .............. 1 participant with a huge basement full of servers, ranking 1st
SETI.Germany ............................ 22 participants, ranking 18th…391st

Markfw · Apr 23, 2023

StefanR5R said:
Ural Federal University .............. 1 participant with a huge basement full of servers, ranking 1st

or 7-8 9654 servers optimally configured.

Edit: I did the math. 65m credits total would be the goal. a 9654 can do 144 units/week(14 hours to do a unit) @ 62,000 credit each (about per my stats), or 8,928,000 per week,so it would take 7.3 of them to equal his score. so round up to 8.

Edit: or with 11 9654's you could have won the contest altogether ! (1532 units=95m, or 10.64 9654's)

Orange Kid · Apr 23, 2023

Thanks Ken g6 for the Dailies. 😃

Skillz · Apr 24, 2023

StefanR5R said:
Ural Federal University .............. 1 participant with a huge basement full of servers, ranking 1st

What you mean? He only has 1 Intel Q9400 and 1 Intel i7-2600k.

StefanR5R · Jun 11, 2023

The next challenge, celebrating Blaise Pascal's 400th birthday, is about 8 days away. The project is 321-LLR (CPU only, multithreaded).

If you log into your account at the PrimeGrid web site and go to "Edit preferences" via "PrimeGrid preferences", you'll see not only that current 321-LLR work takes on average 44 hours of CPU time (averaged over all recent returns from all kinds of computers; applies to "main tasks" only of course, not the tiny verification tasks), you'll also see the current FFT sizes and processor cache footprint, which is a recent addition to this web page.

Right now it says "FFT sizes: 1120K to 1152K (uses up to 9216K cache per task)", and that's what I saw too when I checked the current range of n according to "Subproject status" on a Haswell CPU. Particularly, all of the current 3*2^n−1 candidates are checked with 1120K long FFTs (8.75 MBytes size), and all of the current 3*2^n+1 candidates with 1152K (9.0 MBytes, as each Fourier series coefficient takes 8 Bytes).

Edit: On a Xeon which uses AVX-512 FFT, the FFT length is 1152K (size = 9.0 MB) for both 3*2^n−1 and 3*2^n+1 candidates.

Markfw · Jun 11, 2023

StefanR5R said:
The next challenge, celebrating Blaise Pascal's 400th birthday, is about 8 days away. The project is 321-LLR (CPU only, multithreaded).

If you log into your account at the PrimeGrid web site and go to "Edit preferences" via "PrimeGrid preferences", you'll see not only that current 321-LLR work takes on average 44 hours of CPU time (averaged over all recent returns from all kinds of computers; applies to "main tasks" only of course, not the tiny verification tasks), you'll also see the current FFT sizes and processor cache footprint, which is a recent addition to this web page.

Right now it says "FFT sizes: 1120K to 1152K (uses up to 9216K cache per task)", and that's what I saw too when I checked the current range of n according to "Subproject status" on a Haswell CPU. Particularly, all of the current 3*2^n−1 candidates are checked with 1120K long FFTs (8.75 MBytes size), and all of the current 3*2^n+1 candidates with 1152K (9.0 MBytes, as each Fourier series coefficient takes 8 Bytes).

Not sure how this exactly relates, but running this on my 9554 64 core using 4 threads at a time, it runs about 5 hours average, but for 16 jobs at a time.. That would be 20 CPU hours ?? It does use avx-512 I know that.

The 7950x does 4, 4 thread units at a time in about 4 hours. It could be a little less depending on your 7950x config, mine is set for 140 watts,

PrimeGrid Challenges 2023

Elite Member

Moderator Emeritus, Elite Member

Elite Member

Elite Member

Lifer

Moderator Emeritus, Elite Member

Programming Moderator, Elite Member

Elite Member

Elite Member

Programming Moderator, Elite Member

Programming Moderator, Elite Member

Lifer

Diamond Member

Senior member

Elite Member

Diamond Member

Moderator Emeritus, Elite Member

Lifer

Programming Moderator, Elite Member

Elite Member

Moderator Emeritus, Elite Member

Elite Member

Senior member

Elite Member

Moderator Emeritus, Elite Member