PrimeGrid Challenges 2021

StefanR5R · Jan 10, 2021

For maximum throughput, the FMA units need to be utilized as much as possible. To achieve this,

avoid to overuse the caches, as discussed. If the active data cannot be cached completely, the memory controller(s) and RAM likely don't have the high throughput and low latency which is required to keep the FMA execution units fed.
Use a sufficiently high number of threads in total across all tasks, such that all FMA execution units are engaged.
Use as few threads per task as possible (while keeping the above in mind). More threads per task mean more time spent with inter-thread synchronization than with actual computation.

Icecold said:
Has anybody tested the optimal thread count(in terms of PPD) for this FFT length yet?

For Zen2, @biodoc's results from 2019 with <2 MB FFT data size may still be representative of what works best with <4 MB FFT data size, since Zen2 has got 4 MB L3$ per core and 1 FMA3 unit per core (or: 1 AVX2 pipeline per core). @biodoc measured PPD and PPD/W back then:

StefanR5R said:
@biodoc's tests in October 2019 showed that the efficiency optimum on Zen2 depends on the CPB configuration: #83, #86. These tests were done with a workunit with 240K FFT length, that is, less than 2 MB FFT data size.)

crashtech · Jan 10, 2021

I wonder about the chances of the data size exceeding ~~2MB~~ 4MB.

Edited as not to mislead.

StefanR5R · Jan 10, 2021

Do you mean, exceeding 4 MB size (512 K length)?
In September 2020, somebody posted a chart how the FMA3 FFT length with PPS-DIV develops depending on n:
https://www.primegrid.com/forum_thread.php?id=9339&nowrap=true#143703

(Work for all of the tabulated k's is given out all the time, and the n's which make up the first column in this chart are slowly creeping higher and higher. I.e., we are slowly advancing line by line of this chart.)

While I have no idea how fast the progress during the challenge will be, it seems to me as if the next length of 576 K is still a good way off. We are now square in the middle of the 480 K area.

Edit:
In my older tests, I found that the performance drop when processor caches are overused is not a sharp one, but merely gradual.

StefanR5R · Jan 10, 2021

StefanR5R said:
In my older tests, I found that the performance drop when processor caches are overused is not a sharp one, but merely gradual.

Meanwhile I posted the updated recipe for offline tests and test results for i7-7700K @ 3.4 GHz with a PPS-DIV workunit which takes 480K long FMA3 FFTs. While best throughput can be had with 2 concurrent tasks on this processor as expected, running "too many" tasks results in 7…12 % loss of throughput with 4 concurrent tasks, and 18 % loss of throughput with 8 concurrent tasks.

I should test this processor once more at a higher core clock but same RAM speed. I suppose the benefit of keeping the entire workload cached is more pronounced then.

StefanR5R · Jan 12, 2021

As expected, the loss of throughput when the workload no longer fits into cache becomes more of an issue (a) when core clocks are higher at same RAM performance, (b) when there is a higher core count relative to RAM performance.

Furthermore, the comparison of @biodoc's tests on Ryzen (3700X, 3900X) and my tests on Epyc (7452) indicates that the configuration of task count × thread count for best throughput on Zen2 depends on the power budget. Evidently, SMT helps if the power budget per core is high, but SMT may hurt if this budget is low. Though maybe processor firmware plays a role too.

What's weird is how well 8-threaded tasks perform on EPYC (when SMT is used). This configuration got me almost optimum throughput, and best power efficiency, combined with naturally short run times. I have no explanation why this ran so much better than 4-threaded tasks. To be sure I reran this and some of the other of my tests, and the test durations and the power meter readings were exactly the same in the re-runs.

Ken g6 · Jan 12, 2021

StefanR5R said:
What's weird is how well 8-threaded tasks perform on EPYC (when SMT is used). This configuration got me almost optimum throughput, and best power efficiency, combined with naturally short run times. I have no explanation why this ran so much better than 4-threaded tasks. To be sure I reran this and some of the other of my tests, and the test durations and the power meter readings were exactly the same in the re-runs.

Maybe the threads are occasionally stepping on each others' cores? Maybe if the 4-threaded tasks were assigned to the proper cores, they'd work better? This script can assign PG tasks to cores. I'm not sure if they're "proper" cores, but it would be interesting to see how much "clump" or "spread" helps.

StefanR5R · Jan 12, 2021

Right, I can try this with taskset added to my script.

(Note to self: lscpu shows that logical CPUs 0-31,64-95 belong to socket 0, logical CPUs 32-63,96-127 belong to socket 1. lscpu -e indicates that logical CPUs 0-3,64-67 belong to the first CCX, 4-7,68-71 to the second, and so on.)

crashtech · Jan 12, 2021

I was having a difficult time imagining that numbering description, but then stumbled upon this diagram of a Threadripper 3970X, which helped it make sense, even though it "only" has half the cores:

biodoc · Jan 13, 2021

The challenge has started. Join in!

crashtech · Jan 13, 2021

I'm all in.

Markfw · Jan 13, 2021

I joined with a 3900x@4 ghz, all cores. (except one for the video card)

TennesseeTony · Jan 13, 2021

I am partly in. If I translate StefanR5R correctly, Ryzen 3000 and above should be run at 8 hyper-threads. This means no threads for my GPUs, which is ok, the weather is suppose be a bit warmer for a day or two.

I will give credit where it is due, PG sure knows how to push the envelope in regards to hardware utilization. I wonder how much attention to PG other BOINC developers devote? It seems they could learn a lot.

Markfw · Jan 13, 2021

TennesseeTony said:
I am partly in. If I translate StefanR5R correctly, Ryzen 3000 and above should be run at 8 hyper-threads. This means no threads for my GPUs, which is ok, the weather is suppose be a bit warmer for a day or two.

I will give credit where it is due, PG sure knows how to push the envelope in regards to hardware utilization. I wonder how much attention to PG sub-projects other BOINC developers pay? It seems they could learn a lot.

I have not read up on how to do the 8 hyper-threads. I have 23 threads running. I have surgery in the morning, so this will have to do. 90% at 13 minutes is the speed I am running at.

Skivelitis2 · Jan 13, 2021

I'm all in. Love crunching PrimeGrid especially during these challenges. I truly consider this to be the gold standard of BOINC projects. Top notch all around.

TennesseeTony · Jan 13, 2021

Best of luck on the surgery, Mark!

StefanR5R · Jan 14, 2021

TennesseeTony said:
If I translate StefanR5R correctly, Ryzen 3000 and above should be run at 8 hyper-threads.

@biodoc's results compared to mine show that my EPYC based tests don't reflect very well what's best on Ryzen. Also, my 8-threaded results were on a processor with 4C/CCX, therefore are not valid on processors with 3C/CCX.

For reference:

Ryzen 3900x tests (12C/24T, 105 W — 3C/CCX, 9 W/core)
Ryzen 3700X tests (8C/16T, 65…45 W — 4C/CCX, 8…6 W/core)
Kaby Lake and EPYC tests (4C/8T, and 2x32C/64T respectively, the EPYC has 5 W/core PPT)

Markfw said:
I joined with a 3900x@4 ghz, all cores. (except one for the video card)

Markfw said:
I have 23 threads running. I have surgery in the morning, so this will have to do.

Performance will go up somewhat if you reduce this to 12 tasks at once. Yep, less will do more. :-) It will be beneficial to PrimeGrid and to the GPU (Folding, certainly).

Or even reduce to 11 tasks at once, which will put a small dent into PrimeGrid but will help keep up Folding performance even more.

The reason why 12 or 11 tasks at once will result in more tasks finished per day than 23 tasks at once, is that the latter spend more time fighting for RAM access. With the fewer tasks, the work stays mostly in the CPU cache and the available RAM bandwidth is used more effectively, and the AVX2 units can be fully utilized.

A quick way to switch over to 12 tasks is to set "Use at most 50 % of the CPUs" in computing preferences. Or "…49 %…" for 11 tasks at once.

It is possible to configure more than 1 thread per task, either through the PrimeGrid web preferences, or with an app_config.xml. But you will probably get the most tasks/day done if you stick with single-threaded tasks but no more than 1 task per real core (i.e. 12 tasks at once on Ryzen 3900X).

I wish you well for the surgery. Have a good recovery!

StefanR5R · Jan 14, 2021

BTW, I currently have just one 2xE5v4 in the race. When the first task finished, and the client attempted to request more work while starting the first two file uploads, the PrimeGrid server's scheduler timed out, and the uploads timed out too. That's not unexpected. This server already struggled in an earlier race even before the introduction of LLR2, in a race with another subproject with comparably small tasks = many tasks in progress = high database load. (At that race, the server was unable to keep up the 15 minutes period of race stats updates.) Now, with LLR2, they have not only high database load during a race, but also several magnitudes more network I/O and mass storage I/O.

TennesseeTony · Jan 14, 2021

Apparently, one needs to check the "use CPU" checkbox in the PG preferences, or one will awake to a rather cold home.

biodoc · Jan 14, 2021

I decided on the following settings which are in the "SMT on but not used" category. Each task is expected to use 3.75 MB of cache.

3700X (32 MB L3 cache):
8 simultaneous tasks x 1 core each.
8 x 3.75 MB = 30 MB L3 cache used.

3900X (64 MB L3 cache):
12 simultaneous tasks x 1 core each.
12 x 3.75 MB = 45 MB L3 cache used.

3950X (64 MB L3 cache)
16 simultaneous tasks x 1 core each.
16 x 3.75 MB = 60 MB L3 cache used.

The tasks on the 3950X are running significantly slower than the other 2 computers (clock speeds are similar). I wonder why?

EDIT: the top computer is the 3950X, the 2nd is the 3900X and the bottom is the 3700X.

StefanR5R · Jan 14, 2021

biodoc said:
The tasks on the 3950X are running significantly slower than the other 2 computers (clock speeds are similar). I wonder why?

What about credit/run time, and extrapolated PPD/computer?

(You can look up your pending credit of results which weren't validated yet at Your account -> Pending credit -> View. This table can also be sorted by host ID.)

biodoc · Jan 14, 2021

StefanR5R said:
What about credit/run time, and extrapolated PPD/computer?

EDIT: I switched the 3950X to 8 tasks x 4 threads per task.

StefanR5R · Jan 14, 2021

If the 3950X and 3900X were operating at the same power limit, then 130 kPPD : 118 kPPD looks alright to me.

- - - - - - - -

StefanR5R said:
I currently have just one 2xE5v4 in the race.

It's doing 230 kPPD after validation. But from what I believe I saw today in the morning, it took 550 W or something like that.

geecee · Jan 14, 2021

Hmm, getting DLL initialization errors on one of my boxes (example from event log: PrimeGrid | Task llrDIV_358048398c_0 exited with a DLL initialization error.) for tasks for this challenge. Strange, tried rebooting, setting my CPU usage to 100%, and removing and re-adding the project, but still getting the errors. I see some people posting about this on BOINC topics in general, but mostly old threads and old BOINC versions. Anyone ever see this before? Thanks.

EDIT: Running the latest version of BOINC.

Ken g6 · Jan 14, 2021

I don't use Windows anymore, but you probably need to install an MSVC runtime of some kind.

https://stackoverflow.com/questions/61352248/any-ideas-on-how-to-fix-a-dll-initialization-error-in-visual-studio-2019

biodoc · Jan 14, 2021

StefanR5R said:
If the 3950X and 3900X were operating at the same power limit, then 130 kPPD : 118 kPPD looks alright to me.

Both are at 105 W PPT. The 3950X seems to be doing better at 8 tasks/4 threads each but I don't have enough data yet to be confident in that conclusion. The 3950X has been running on unbuffered ECC RAM for the last few days but I don't think RAM speed/timings would have an impact in this situation.

PrimeGrid Challenges 2021

Elite Member

Lifer

Elite Member

Elite Member

Elite Member

Programming Moderator, Elite Member

Elite Member

Lifer

Diamond Member

Lifer

Moderator Emeritus, Elite Member

Elite Member

Moderator Emeritus, Elite Member

Member

Elite Member

Elite Member

Elite Member

Elite Member

Diamond Member

Elite Member

Diamond Member

Elite Member

Platinum Member

Programming Moderator, Elite Member

Diamond Member