Question CPUs for shared memory parallel computing

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Nothingness

Diamond Member
Jul 3, 2013
3,137
2,153
136
I think the problem is Mathematica. I've read a bit about the guy (he's a rare genius) and Wolfram Alpha was pretty impressive back in the day when I tried it. However, I believe he thinks that the world doesn't appreciate him or his contributions enough so he's resorted to overcharging for his abilities and products. This would be enough for me to never touch Mathematica because I dislike any company/individual engaged in price gouging.
For a product similar to Mathematica, look at Maple license cost. Are they overcharging too?

For people who need that kind of products, they're worth every penny spent.
 
Jul 27, 2020
20,917
14,491
146
You should at the very least do a "quick" Windows Memory Diagnostics test.

May take about an hour at most.

Better than pulling hair later in case of unexplained crashes or whatever.

After the test is done, you may wanna "tweak" the RAM timings a bit, for lower CAS latency, even just one less than what the sticks are defaulting to, as 10900K memory controller is pretty good from what I've heard.

If Windows boots with that, do the quick memory diagnostics test again and then you can run your workload gladly knowing that your RAM is running faster than what it's rated at.
 
Reactions: Hermitian

Hitman928

Diamond Member
Apr 15, 2012
6,391
11,392
136
That's standard procedure for a factory-authorized warranty installation.

But I didn't know about that BiOS setting - thanks for the pointer!

Also, someone here recommend turning on XMP so that's already completed.

If you turned on XMP after the memory test was performed, you'll want to test the memory again as XMP essentially overclocks the memory.
 
Jul 27, 2020
20,917
14,491
146
But I didn't know about that BiOS setting - thanks for the pointer!
I don't know what your mobo model is but I don't think you can change CAS latency with EXPO turned on. You will have to look at the CAS and subtimings that EXPO is applying, change to manual timings and then apply one less.

Like suppose your kit's EXPO timings are 16-22-22-22.

Change to manual and try running the kit at 15-21-21-21.

If your mobo has a realtime memory timings setting that you can turn on, you can experiment in Windows itself with the timings by using Intel's XTU: https://www.intel.com/content/www/us/en/download/17881/intel-extreme-tuning-utility-intel-xtu.html

XTU's AI overclock may get you a quick 5% speed boost too.
 
Reactions: Hermitian

Thunder 57

Diamond Member
Aug 19, 2007
3,079
4,873
136
@Hermetian

You have been so abrasive in your posts I am surprised anyone is still trying to help you. You come here asking for assistance and you have been an asshat about it.
 
Reactions: Markfw

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
I'm just about done rewriting the code. Previously it was a two program operation, first the regex scans with some basic filtering, then post processing with narrow filtering. I've rewritten the first part in such a way that the second is unnecessary. This week I also wrote an API plus some documentation. Overall, the run times on my "standard" datasets are down from 30+ minutes to 4-5 minutes. As someone here predicted, my week-long runs can now be executed overnight.

With regard to the memory upgrade, Windows 11 and my standard datasets sure look small at runtime! Windows appears to consume about 4% of RAM, and this program another 4% during execution. The upgrade has also permitted running more "kernels" (parallel processes). I'm up from 9 to 12, which is reflected in the 4.5 minute run above.

The "wannabe" datasets are twice as long in search length, which will about double the runtimes. The size will also increase memory load. By this time next week I hope to have some test data from them.
 

yottabit

Golden Member
Jun 5, 2008
1,527
592
146
I know the “AMD Epyc” angle has been somewhat discussed to death, but I did want to point out there are system builders who could get you into one with 12 channel memory for not far over the $5k.

The “F” skus are frequency optimized and have good cache, in this case the 24 core is actually $1000 cheaper than the 16 for some reason.

You would lose some clock speed yes, but the performance per clock on these CPUs is quite high. And then you’d be gaining a massive amount of memory bandwidth.

For whatever reason, configuring these sort of custom Epyc builds seems to be much better value than the equivalent Threadripper.

From what I understand, the low core count Threadrippers also cannot really take advantage of the full memory channels whereas any Epyc CPU should be able to fully saturate all 12. In my opinion, it’s quite affordable for what you get. I did not add hard drives etc yet. It’s only another $800 to upgrade to 12x 32 gb RAM for 384 GB total.


These CPUs are based on AMD Zen4 architecture. Zen5 is around the corner and may bring large uplifts. Maybe something to consider when you do need the really large dataset!

I have no experience with this system builder just wanted to provide some option. I know all this because the 24 core F CPU is one of the best value for CFD simulations, with the high frequencies cache and 2:1 ratio of cores to memory channels.

Edit: added link
 
Last edited:
Reactions: Tlh97 and Hermitian

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
@yottabit
Thank you for the details. 🙂
4.8 GHz is tempting.
The # of cores and channels is also interesting, esp. since my current Mathematica license limits me to 12 parallel processes.
I'm not sure if my liquid cooler would fit the form factor of these "builders", so a current system might run about $7,000.

I agree with you that the future is promising.
 

StefanR5R

Elite Member
Dec 10, 2016
6,056
9,106
136
@Hermetian, if you are willing to remove and later reseat DIMMs without assistance of your system builder, you could run scaling tests (throughput of 1, 2, 3, …, 10 concurrently running kernels) on your 2-channels × 2-DIMMs-per-channel computer once with only one memory channel populated, then with both channels populated. This would give more insight into how much memory bandwidth per kernel is desirable, hence whether or not it would make sense to move to a more expensive high channel count platform. (In this context, testing with 11 or 12 concurrently running kernels is not as interesting, as the i9-10900KF is a 10-cores CPU.)

Though even better than simply measuring performance of such scaling tests, actually profiling would give a lot more insight into potential bottlenecks. But not being a programmer myself, I have no suggestions to you into which tools to look into for this purpose.

Regarding CPU coolers for high-memory-channel-count computers: Large tower coolers (that is, air coolers) are IMO always much preferable whenever they are available for the given socket and fit the given computer case. For socket SP5 (EPYC 9004 and the upcoming 9005) in particular, existing tower coolers aren't very large, but probably more than large enough for your purpose of utilizing no more than 12 cores.

Your current socket FCLGA1200 cooler won't fit on socket SP5.
 
Last edited:
Reactions: Hermitian

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
@StefanR5R
Thank you for the ideas and specs. 🙂

The combinatorics of this computation are:
(# of targets)x(# of queries)x(# of regex per query).
My current design parallelizes across the #regex.
I'm checking to see if this is still the best choice. In fact, I might need a hybrid algorithm that varies by (length of targets i.e. MB), #targets, and #queries.
The parallelism introduces additional latencies of kernel (process) startup and management, plus in my case the writing of independent results to SSD and then reading those files by the master process after the slaves have finished.
At present I'm examining how all the performance factors stack up.

Yesterday I ran my first benchmark on a "standard" target lengths of 15MB with 2 targets and two queries. For benchmark purposes, I turn off Windows update, lock screen, etc. Here's the results. Today I'll run a broader test with target lengths ranging from 10kB to 10MB.

In the plot:
elapsed time = 1 corresponds to 2.36 hours.
ideal time = 1/#kernels.
decibel latency = 10 x (elapsed time - ideal time) x log10(#kernels).

 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,410
4,175
75
If you invert the plot of elapsed time, to runs per hour or the like, and make the x-axis logarithmic, you might get both better resolution and an idea of how it's following Amdahl's law.



If it fits well that might give you an idea of how it would perform on more cores.
 
Reactions: Hermitian

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
Reactions: Nothingness

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
@Ken g6
I have an extended Mathematica license permitting up to 12 parallel kernels. The next step up is out of my budget. Thus a 12 core system is about all I'll consider.

Here's some additional data from the test that produced the chart above. At 12 kernels I'm approaching the maximum processor load, but there is plenty of memory available (I currently have 128GB). This indicates that for the given problem size, I could parallelize at a higher level without impacting memory resources. For this reason I'm currently running a performance tests across a wide range of problem sizes to determine where parallelism paradigm shifts might be warranted.

 

LightningZ71

Golden Member
Mar 10, 2017
1,910
2,260
136
From your earlier post, it appears that your parallelism tends to hit a wall around 8 processes as scaling really falls off. I don't necessarily think that adding more cores or processes beyond that is going to help much. Profiling the runtime operation of the whole system is becoming more necessary at this point.
 
Reactions: Hermitian
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |