Info PrimeGrid Challenges 2024, sieve-free edition

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

StefanR5R

Elite Member
Dec 10, 2016
5,885
8,747
136
With current workunits I see either 2x384K (7,9XX credits) or 2x400K FFT length (10,5XX…10,8XX credits). 2x400K FFT length translates to 6.25 MBytes data size.

Quite a jump in workunit size, and in credits per result. Goes to show that testing with random workunits requires to compare by credit/time, not just by completed tasks/time.
 
Last edited:

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,330
4,005
75
Day 0.5 stats:

Rank___Credits____Username
4______1223801____crashtech
9______825115_____markfw
14_____544882_____w a h
28_____252589___10esseeTony
31_____203925_____Orange Kid
33_____185693_____cellarnoise2
52_____113199_____Fardringle
56_____109819_____mmonnin
79_____73096______Letin Noxe
85_____59545______Ken_g6
100____44710______biodoc
126____23611______mike656
128____22922______Pokey
203____1021_______waffleironhead

Rank__Credits____Team
1_____3683934____TeAm AnandTech
2_____3465276____Antarctic Crunchers
3_____2594404____Czech National Team
4_____2148771____SETI.Germany

An uneasy first place so far.
 

Orange Kid

Elite Member
Oct 9, 1999
4,375
2,164
146
Well, I had a perfect start for a 08:00 start time. My plan worked perfectly and I lost no sleep.
Too bad I didn’t see that it was a 08:08 start 🫣
 
Reactions: Skillz

StefanR5R

Elite Member
Dec 10, 2016
5,885
8,747
136
Code:
Summary for AMD EPYC 7452 32-Core Processor, test cutoff: 25 minutes
  candidate  |   credit   | tasks x threads, affinity |     task duration     | tasks/day | points/day
-------------+------------+---------------------------+-----------------------+-----------+-----------
  4651711#-1 |   7,306.04 | 32x2, ascending           |   7:26:43 =   26803 s |       103 |    753,632
  4651711#-1 |   7,306.04 | 32x4, ascending           |   7:25:35 =   26735 s |       103 |    755,546
  4651711#-1 |   7,306.04 | 16x4, ascending           |   3:22:09 =   12129 s |       113 |    832,698
  4651711#-1 |   7,306.04 | 16x8, ascending           |   6:24:52 =   23092 s |      59.8 |    437,368
  4651711#-1 |   7,306.04 | 16x8, 0-3,64-67 4-7,68-7~ |   3:22:03 =   12123 s |       114 |    833,115
This is a dual-7452 computer, that is, 128 threads total. PPT limit is set to 180W/socket.

Power efficiency:
32x2: 750 kPPD / 480 W = 1.6 kPPD/W
32x4: 760 kPPD / 505 W = 1.5 kPPD/W
16x4: 830 kPPD / 430 W = 1.9 kPPD/W
16x8 cross-CCX: 440 kPPD / 480 W = 0.9 kPPD/W
16x8 in-CCX: 830 kPPD / 450 W = 1.8 kPPD/W

The "16x8 ascending" config was a brain fart of mine, but still an interesting showcase for how dramatically the performance tanks if tasks need to communicate across CCXs. I realized this only after the four tests completed, and therefore added the 5th in which all tasks are reigned in into CCX boundaries again.

1.9 kPPD/W of this 7nm Zen 2 dual-socket computer, compared to 2.0 kPPD/W of the 5nm Zen 4 9554P single-socket computer, is surprisingly decent. Caveat: I performed only a single run per testcase, so I am not sure how precise these were. Plus, this testing relies on PRST's own progress percentage logging, and I am not sure about it's precision yet.

Edit: Also, I am a bit surprised to see 16x4 providing higher throughput than 32x2. I expected it to be the other way around.
 
Last edited:

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,330
4,005
75
Day 1.5 stats:

Rank___Credits____Username
2______7680560____w a h
3______7490359____[TA]Skillz
8______4934972____crashtech
10_____3980235____markfw
18_____2293848____cellarnoise2
23_____1686230___10esseeTony
33_____1055854____Orange Kid
42_____774667_____biodoc
47_____674335_____Fardringle
61_____556635_____mmonnin
63_____493431_____waffleironhead
109____250835_____Letin Noxe
113____236063_____Pokey
130____202042_____Ken_g6
134____184173_____mike656
164____106889_____DROFFUNGUS
173____92680______kiska
225____50173______Icecold

Rank__Credits____Team
1_____32743990___TeAm AnandTech
2_____29231091___Antarctic Crunchers
3_____12134263___Czech National Team
4_____10318630___SETI.Germany
 

StefanR5R

Elite Member
Dec 10, 2016
5,885
8,747
136
I tested older kit now too:
Code:
Summary for Intel(R) Xeon(R) CPU E5-2696 v4, test cutoff: 15 minutes
  candidate  |   credit   | tasks x threads, affinity |     task duration     | tasks/day | points/day
-------------+------------+---------------------------+-----------------------+-----------+-----------
  4651711#-1 |   7,306.04 | 4x11, none                |   2:25:37 =    8737 s |      39.5 |    288,990
  4651711#-1 |   7,306.04 | 4x22, none                |   2:30:10 =    9010 s |      38.3 |    280,237
  4651711#-1 |   7,306.04 | 6x8, none                 |   3:32:49 =   12769 s |      40.5 |    296,610
  4651711#-1 |   7,306.04 | 6x14, none                |   3:18:45 =   11925 s |      43.4 |    317,600
  4651711#-1 |   7,306.04 | 8x6, none                 |   4:28:09 =   16089 s |      42.9 |    313,874
  4651711#-1 |   7,306.04 | 8x11, none                |   4:12:42 =   15162 s |      45.5 |    333,060
  4651711#-1 |   7,306.04 | 10x5, none                |   5:42:23 =   20543 s |      42.0 |    307,277
  4651711#-1 |   7,306.04 | 10x8, none                |   5:26:28 =   19588 s |      44.1 |    322,254
  4651711#-1 |   7,306.04 | 12x4, none                |   6:45:51 =   24351 s |      42.5 |    311,069
  4651711#-1 |   7,306.04 | 12x7, none                |   6:53:44 =   24824 s |      41.7 |    305,144
This is a dual-socket computer again (2x 22c/44t). The CPUs are allowed to run at turbo clock all the time = all-core AVX2 turbo = 2.6 GHz.
Best config with all HT threads used:
8x11: 333 kPPD / 505 W = 0.66 kPPD/W​
Best config with HT almost unused:
8x6: 314 kPPD / 455 W = 0.69 kPPD/W​

(Edit: Of course the now current PRST workunits are considerably larger and slower then my test workunit, but get more credit/result.)
 
Last edited:
Reactions: TennesseeTony

crashtech

Lifer
Jan 4, 2013
10,573
2,145
146
...(Edit: Of course the now current PRST workunits are considerably larger and slower then my test workunit, but get more credit/result.)
If they are larger, would not the L3 requirement have grown also, potentially affecting which config is best?
 

StefanR5R

Elite Member
Dec 10, 2016
5,885
8,747
136
would not the L3 requirement have grown also, potentially affecting which config is best?
This is possible.

As noted, the data size of my test WU is 6.0 MB. In case of LLR2 and Genefer, we know that we should run at most as many program instances at once in each last-level cache domain that the sum of datasizes does not exceed last level cache size. My test on the Haswell with 8 MB inclusive L3$ indicates that the same is true with PRST.

But my test series on Broadwell-EP with 55 MB L3$/socket (inclusive cache) gives me pause:
4 tasks at once (x11 threads): 24 MB data size in total, 290 kPPD
6 tasks at once (x14 threads): 36 MB data size in total, 320 kPPD
8 tasks at once (x11 threads): 48 MB data size in total, 330 kPPD
10 tasks at once (x8 threads): 60 MB data size in total, 320 kPPD
12 tasks at once (x4 threads): 72 MB data size in total, 310 kPPD

Does this indicate that each program instance needs notably more cache than the datasize of the FFT coefficients? Maybe, maybe not. I ran this without CPU affinities, because
1.) as you know Broadwell-EP's L3$ is shared by all cores,
2.) the Linux kernel generally does a good job of keeping all threads of a program instance on the same socket.
Well, maybe there was still enough back and forth between sockets going on that the 10x and 12x tests suffered somewhat.

I would be wiser if I took the time to re-test with CPU affinities (forcing half of the tasks onto one socket and the other half on the other), and/or to test with one of the current 2x480K (7.5 MB) workunits.

When I set up the test series, my foremost concern was how to divide this awkward CPU count of 2x22c/44t up into task counts x thread counts. After arriving at the above combos with up to 12 tasks, my thought was that 2*55 MB cache will likely be good for 12*8 MB FFT data, so that it shouldn't matter much whether I kept the old workunit for consistency or tested with a different bigger workunit...
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,330
4,005
75
Day 2.5 stats, a little past halfway.

Rank___Credits____Username
2______20200071___[TA]Skillz
5______13149584___w a h
7______9009100____crashtech
12_____7035271____markfw
18_____4162377____cellarnoise2
25_____3388452___10esseeTony
33_____1968968____Orange Kid
35_____1883977____biodoc
45_____1338827____Icecold
46_____1296299____Fardringle
60_____1033331____mmonnin
67_____931777_____waffleironhead
106____512305_____Letin Noxe
119____435215_____Pokey
131____376294_____mike656
139____345471_____Ken_g6
148____298736_____DROFFUNGUS
188____184662_____kiska
301____36364______ChelseaOilman

Rank__Credits____Team
1_____67587090___TeAm AnandTech
2_____55398173___Antarctic Crunchers
3_____24816808___Czech National Team
4_____20494146___SETI.Germany

Welcome to mike656!
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,330
4,005
75
Day 3.5 stats:

Rank___Credits____Username
2______30115596___[TA]Skillz
5______17777567___w a h
7______12680790___crashtech
11_____9411629____markfw
18_____5742224____cellarnoise2
22_____4948027___10esseeTony
32_____2838533____biodoc
35_____2701834____Orange Kid
38_____2392315____Icecold
47_____1823535____Fardringle
58_____1479517____mmonnin
67_____1244856____waffleironhead
104____750699_____Letin Noxe
108____739200_____Pokey
129____550896_____mike656
136____490931_____DROFFUNGUS
140____478837_____Ken_g6
202____233326_____kiska
339____36364______ChelseaOilman

Rank__Credits____Team
1_____96436686___TeAm AnandTech
2_____77404106___Antarctic Crunchers
3_____34366587___Czech National Team
4_____27943683___SETI.Germany
 

StefanR5R

Elite Member
Dec 10, 2016
5,885
8,747
136
Current head counts of the top ten teams:
1 ...... 19 ...... TeAm AnandTech
2 ...... 24 ...... Antarctic Crunchers
3 ...... 32 ...... Czech National Team
4 ...... 27 ...... SETI.Germany
5 ........ 9 ...... Aggie The Pew
6 ........ 6 ...... AMD Users
7 ........ 9 ...... Ukraine
8 ........ 6 ...... The Knights Who Say Ni!
9 ........ 2 ...... [H]ard|OCP
10 ...... 8 ...... Planet 3DNow!
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,049
15,191
136
We are now over 20 million ahead of second place !!!!
 

StefanR5R

Elite Member
Dec 10, 2016
5,885
8,747
136
From the challenge thread at PrimeGrid's message board:
pyajve said:
Will there be FFT (CPU cache) requirements on the project preferences page like for all other sub-projects?
Michael Goetz said:
Sorry, but no. It isn't possible at this time.
chr80 said:
From what I found, this task needs about 15 MB of L3 cache:
To estimate how much L3 cache memory this task uses on the processor, we need to consider the following information from the stderr message:

FFT length: "Using Montgomery reduction AVX FFT length 2x480K"

This means that the length of the Fourier Transform (FFT) is 2x480K, or 960 thousand points.

Calculating memory requirements
The 960K FFT transform operates on complex numbers (each FFT point is usually two numbers: real and imaginary). Depending on the implementation, each complex number can take up from 8 to 16 bytes (in the case of double precision). Assuming the most common scenario, i.e. 16 bytes per complex number:

Requirement for one FFT point: 16 bytes
Number of FFT points: 960K (i.e. 960 * 1024 = 983040 points)
Total memory requirement:
983040×16 B=15,728,640 B (approximately 15 MB)

Is this estimate/calculation correct?
Yves Gallot said:
There are two variables (read-write) and two constants (read-only).
It is a real FFT then size is 8 * 480 kB for each number.
4 * 8 * 480 = 15 MB.
But depending on memory bandwidth, 2 * 8 * 480 = 7.5 MB or 3 * 8 * 480 = 11.25 MB may be sufficient because of the two constants.
IOW, it is not as straightforward as with Genefer and LLR2.

It's a pity that the workunits which were available before the challenge were so much smaller than the ones quite soon after the start of the challenge. Plus, the required test durations for reliable data are larger than with LLR2, not to mention Genefer. But if somebody has got several computers with same CPU and memory, diverting one computer for additional tests during a challenge isn't such a bad idea.

Edit:
The Zen 4 CCXes in Ryzen 7000 and EPYC 9004 have almost twice as much read bandwidth to the IOD than write bandwidth, if I understood correctly. If so, the extra read bandwidth should help with access to constant FFT data.

10esseeTony said:
ignore the 7950X, it does not perform properly when multi-threaded and is back to single thread tasks (sieve).
As this one runs under Windows, I suggest that pschoefer's straightforward powershell script is used to pin tasks to CCXs by means of CPU affinities. This, or any of the several other methods to assign CPU affinity should give any dual-CCX Ryzen a considerable boost at PrimeGrid.

The GCW-LLR challenge in September will have 25 MB cache requirement per task. If I don't forget, I shall repost pschoefer's script ready to use for GCW-LLR before the challenge.

edit 2: typo
 
Last edited:

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,330
4,005
75
Day 4.5 stats:

Rank___Credits____Username
2______42134542___[TA]Skillz
4______24060098___w a h
7______16974988___crashtech
11_____12195446___markfw
18_____7457110____cellarnoise2
22_____6591093___10esseeTony
31_____3937119____biodoc
34_____3549909____Icecold
36_____3417316____Orange Kid
46_____2407711____Fardringle
56_____1953452____mmonnin
68_____1687737____waffleironhead
94_____1032588____Pokey
106____972995_____Letin Noxe
130____725221_____mike656
137____642042_____DROFFUNGUS
145____591555_____Ken_g6
197____338975_____kiska
363____36364______ChelseaOilman

Rank__Credits____Team
1_____130706268___TeAm AnandTech
2_____110025437___Antarctic Crunchers
3_____46173159___Czech National Team
4_____38199710___SETI.Germany

Almost done, but I'll probably be asleep when it ends.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,049
15,191
136
10 million point lead. I will be turning mine off at 10-11 pm. 2 hours won't mean much from me.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,330
4,005
75
More-or-less final stats:

Rank___Credits____Username
2______47442486___[TA]Skillz
4______25876875___w a h
7______18755162___crashtech
11_____13262223___markfw
19_____7795964____cellarnoise2
23_____7169267___10esseeTony
31_____4378535____biodoc
34_____4017416____Icecold
36_____3714876____Orange Kid
49_____2654900____Fardringle
58_____2128434____mmonnin
70_____1812733____waffleironhead
93_____1146259____Pokey
110____998341_____Letin Noxe
129____800366_____mike656
142____680416_____DROFFUNGUS
151____642603_____Ken_g6
192____388797_____kiska
370____36364______ChelseaOilman
409____7384_______MangoX

Rank__Credits____Team
1_____143709410___TeAm AnandTech
2_____119431333___Antarctic Crunchers
3_____50878682___Czech National Team
4_____42627481___SETI.Germany

Congrats on winning the very first primorial challenge!
 

StefanR5R

Elite Member
Dec 10, 2016
5,885
8,747
136
Congratulations TeAm!

Hmm, when was the last time that there was so little opportunity to fine-tune before a PrimeGrid challenge?
(Actually, some opportunity was there, but at least as far as I am concerned, a lack of spare time got in the way to grasp it.)
 
Reactions: crashtech

TennesseeTony

Elite Member
Aug 2, 2003
4,233
3,665
136
www.google.com
Hmm, when was the last time that there was so little opportunity to fine-tune before a PrimeGrid challenge?

Using Montgomery reduction FMA3 FFT length 2x288K
Using Montgomery reduction FMA3 FFT length 2x384K
Using Montgomery reduction FMA3 FFT length 2x420K
Using Montgomery reduction FMA3 FFT length 2x480K
Using Montgomery reduction FMA3 FFT length 2x512K

Hard to keep up with the tuning.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |