Question CPUs for shared memory parallel computing

Nothingness · Sep 8, 2024

igor_kavinski said:
I think the problem is Mathematica. I've read a bit about the guy (he's a rare genius) and Wolfram Alpha was pretty impressive back in the day when I tried it. However, I believe he thinks that the world doesn't appreciate him or his contributions enough so he's resorted to overcharging for his abilities and products. This would be enough for me to never touch Mathematica because I dislike any company/individual engaged in price gouging.

For a product similar to Mathematica, look at Maple license cost. Are they overcharging too?

For people who need that kind of products, they're worth every penny spent.

Hermetian · Sep 8, 2024

@igor_kavinski
S. Wolfram was never a genius, nor is E. Musk. Both of them though hired brilliant people to start and sustain their enterprise. I've assisted some true geniuses over my career, so I do know the difference. Their capabilities are startling.

igor_kavinski · Sep 8, 2024

Nothingness said:
For a product similar to Mathematica, look at Maple license cost. Are they overcharging too?

The personal license is $349 though I don't know if it's yearly or perpetual.

Nothingness · Sep 8, 2024

igor_kavinski said:
The personal license is $349 though I don't know if it's yearly or perpetual.

Mathematica perpetual personal license is $390.

Hermetian · Sep 8, 2024

Nothingness said:
Mathematica perpetual personal license is $390.

That is for the current version with the base amount of parallel kernels and includes updates for up to 1 year.

Hermetian · Sep 10, 2024

Full house.

igor_kavinski · Sep 10, 2024

You should at the very least do a "quick" Windows Memory Diagnostics test.

May take about an hour at most.

Better than pulling hair later in case of unexplained crashes or whatever.

After the test is done, you may wanna "tweak" the RAM timings a bit, for lower CAS latency, even just one less than what the sticks are defaulting to, as 10900K memory controller is pretty good from what I've heard.

If Windows boots with that, do the quick memory diagnostics test again and then you can run your workload gladly knowing that your RAM is running faster than what it's rated at.

Hermetian · Sep 10, 2024

igor_kavinski said:
You should at the very least do a "quick" Windows Memory Diagnostics test.

That's standard procedure for a factory-authorized warranty installation.

But I didn't know about that BiOS setting - thanks for the pointer!

Also, someone here recommend turning on XMP so that's already completed.

Hitman928 · Sep 10, 2024

Hermetian said:
That's standard procedure for a factory-authorized warranty installation.

But I didn't know about that BiOS setting - thanks for the pointer!

Also, someone here recommend turning on XMP so that's already completed.

If you turned on XMP after the memory test was performed, you'll want to test the memory again as XMP essentially overclocks the memory.

Hermetian · Sep 10, 2024

Hitman928 said:
If you turned on XMP after the memory test was performed, you'll want to test the memory again as XMP essentially overclocks the memory.

Oh, the installers turned on XMP before running their install diagnostics. They also run full system diagnostics before their install.

igor_kavinski · Sep 10, 2024

Hermetian said:
But I didn't know about that BiOS setting - thanks for the pointer!

I don't know what your mobo model is but I don't think you can change CAS latency with EXPO turned on. You will have to look at the CAS and subtimings that EXPO is applying, change to manual timings and then apply one less.

Like suppose your kit's EXPO timings are 16-22-22-22.

Change to manual and try running the kit at 15-21-21-21.

If your mobo has a realtime memory timings setting that you can turn on, you can experiment in Windows itself with the timings by using Intel's XTU: https://www.intel.com/content/www/us/en/download/17881/intel-extreme-tuning-utility-intel-xtu.html

XTU's AI overclock may get you a quick 5% speed boost too.

Thunder 57 · Sep 10, 2024

@Hermetian

You have been so abrasive in your posts I am surprised anyone is still trying to help you. You come here asking for assistance and you have been an asshat about it.

Hermetian · Sep 10, 2024

Thunder 57 said:
You have been so abrasive in your posts I am surprised anyone is still trying to help you. You come here asking for assistance and you have been an asshat about it.

My perspective is that I've thanked folks here repeatedly for their input. I've also disagreed with some people and the discussion that followed was helpful.

Thunder 57 · Sep 10, 2024

Hermetian said:
My perspective is that I've thanked folks here repeatedly for their input. I've also disagreed with some people and the discussion that followed was helpful.

Maybe I read too much into your posts. Please accept my apologies.

Hermetian · Friday at 1:53 AM

I'm just about done rewriting the code. Previously it was a two program operation, first the regex scans with some basic filtering, then post processing with narrow filtering. I've rewritten the first part in such a way that the second is unnecessary. This week I also wrote an API plus some documentation. Overall, the run times on my "standard" datasets are down from 30+ minutes to 4-5 minutes. As someone here predicted, my week-long runs can now be executed overnight.

With regard to the memory upgrade, Windows 11 and my standard datasets sure look small at runtime! Windows appears to consume about 4% of RAM, and this program another 4% during execution. The upgrade has also permitted running more "kernels" (parallel processes). I'm up from 9 to 12, which is reflected in the 4.5 minute run above.

The "wannabe" datasets are twice as long in search length, which will about double the runtimes. The size will also increase memory load. By this time next week I hope to have some test data from them.

LightningZ71 · Friday at 9:56 AM

Those sppedups are excellent! I'm very happy for you.

yottabit · Saturday at 10:22 PM

I know the “AMD Epyc” angle has been somewhat discussed to death, but I did want to point out there are system builders who could get you into one with 12 channel memory for not far over the $5k.

The “F” skus are frequency optimized and have good cache, in this case the 24 core is actually $1000 cheaper than the 16 for some reason.

You would lose some clock speed yes, but the performance per clock on these CPUs is quite high. And then you’d be gaining a massive amount of memory bandwidth.

For whatever reason, configuring these sort of custom Epyc builds seems to be much better value than the equivalent Threadripper.

From what I understand, the low core count Threadrippers also cannot really take advantage of the full memory channels whereas any Epyc CPU should be able to fully saturate all 12. In my opinion, it’s quite affordable for what you get. I did not add hard drives etc yet. It’s only another $800 to upgrade to 12x 32 gb RAM for 384 GB total.

AMD EPYC Single Processor Series Workstation

Broadberry CyberStation powered by AMD EPYC 9004 processors can be configured with up to 3TB of RAM and 4x SATA drives.

www.broadberry.com

These CPUs are based on AMD Zen4 architecture. Zen5 is around the corner and may bring large uplifts. Maybe something to consider when you do need the really large dataset!

I have no experience with this system builder just wanted to provide some option. I know all this because the 24 core F CPU is one of the best value for CFD simulations, with the high frequencies cache and 2:1 ratio of cores to memory channels.

Edit: added link

Hermetian · Saturday at 10:48 PM

@yottabit
Thank you for the details. 🙂
4.8 GHz is tempting.
The # of cores and channels is also interesting, esp. since my current Mathematica license limits me to 12 parallel processes.
I'm not sure if my liquid cooler would fit the form factor of these "builders", so a current system might run about $7,000.

I agree with you that the future is promising.

StefanR5R · Sunday at 4:30 AM

@Hermetian, if you are willing to remove and later reseat DIMMs without assistance of your system builder, you could run scaling tests (throughput of 1, 2, 3, …, 10 concurrently running kernels) on your 2-channels × 2-DIMMs-per-channel computer once with only one memory channel populated, then with both channels populated. This would give more insight into how much memory bandwidth per kernel is desirable, hence whether or not it would make sense to move to a more expensive high channel count platform. (In this context, testing with 11 or 12 concurrently running kernels is not as interesting, as the i9-10900KF is a 10-cores CPU.)

Though even better than simply measuring performance of such scaling tests, actually profiling would give a lot more insight into potential bottlenecks. But not being a programmer myself, I have no suggestions to you into which tools to look into for this purpose.

Regarding CPU coolers for high-memory-channel-count computers: Large tower coolers (that is, air coolers) are IMO always much preferable whenever they are available for the given socket and fit the given computer case. For socket SP5 (EPYC 9004 and the upcoming 9005) in particular, existing tower coolers aren't very large, but probably more than large enough for your purpose of utilizing no more than 12 cores.

Your current socket FCLGA1200 cooler won't fit on socket SP5.

Hermetian · Sunday at 2:06 PM

@StefanR5R
Thank you for the ideas and specs. 🙂

The combinatorics of this computation are:
(# of targets)x(# of queries)x(# of regex per query).
My current design parallelizes across the #regex.
I'm checking to see if this is still the best choice. In fact, I might need a hybrid algorithm that varies by (length of targets i.e. MB), #targets, and #queries.
The parallelism introduces additional latencies of kernel (process) startup and management, plus in my case the writing of independent results to SSD and then reading those files by the master process after the slaves have finished.
At present I'm examining how all the performance factors stack up.

Yesterday I ran my first benchmark on a "standard" target lengths of 15MB with 2 targets and two queries. For benchmark purposes, I turn off Windows update, lock screen, etc. Here's the results. Today I'll run a broader test with target lengths ranging from 10kB to 10MB.

In the plot:
elapsed time = 1 corresponds to 2.36 hours.
ideal time = 1/#kernels.
decibel latency = 10 x (elapsed time - ideal time) x log10(#kernels).

Ken g6 · Monday at 2:46 AM

If you invert the plot of elapsed time, to runs per hour or the like, and make the x-axis logarithmic, you might get both better resolution and an idea of how it's following Amdahl's law.

If it fits well that might give you an idea of how it would perform on more cores.

Hermetian · Monday at 4:10 AM

For latest code release and a little documentation, see Fifth Update at

Quest for improved runtime in searches of primer sequences in whole DNA chromosomes - Online Technical Discussion Groups—Wolfram Community

Wolfram Community forum discussion about Quest for improved runtime in searches of primer sequences in whole DNA chromosomes. Stay on top of important topics and build connections by joining Wolfram Community groups relevant to your interests.

community.wolfram.com

Hermetian · Monday at 3:44 PM

Nothingness said:
For people who need that kind of products, they're worth every penny spent.

Lucas Films is one of them.

Hermetian · Monday at 4:31 PM

@Ken g6
I have an extended Mathematica license permitting up to 12 parallel kernels. The next step up is out of my budget. Thus a 12 core system is about all I'll consider.

Here's some additional data from the test that produced the chart above. At 12 kernels I'm approaching the maximum processor load, but there is plenty of memory available (I currently have 128GB). This indicates that for the given problem size, I could parallelize at a higher level without impacting memory resources. For this reason I'm currently running a performance tests across a wide range of problem sizes to determine where parallelism paradigm shifts might be warranted.

LightningZ71 · 2024-09-17T07:23:18-0400

From your earlier post, it appears that your parallelism tends to hit a wall around 8 processes as scaling really falls off. I don't necessarily think that adding more cores or processes beyond that is going to help much. Profiling the runtime operation of the whole system is becoming more necessary at this point.

Question CPUs for shared memory parallel computing

Diamond Member

Member

Lifer

Diamond Member

Member

Member

Lifer

Member

Diamond Member

Member

Lifer

Platinum Member

Member

Platinum Member

Member

Golden Member

Golden Member

Member

Elite Member

Member

Programming Moderator, Elite Member

Member

Member

Member

Golden Member