Question CPUs for shared memory parallel computing

Hermetian · Sep 6, 2024

igor_kavinski said:
I can even let you access my 128 thread eight channel Epyc over Anydesk or any other remote software of your choice

No you can't, because you are lacking a Wolfram license, which would have to be the extended version for extra parallel kernels.

Two things are very clear about my situation, which I've previously stated in this discussion thread: (1) I will have to upgrade my hardware to tackle a larger dataset which is on the back burner for now (2) There's no currently available hardware in my price range that would significantly improve the runtime of that dataset.

In the meantime I'm improving the algorithm, which I'm discussing on the Wolfram community page linked to somewhere above.

Hermetian · Sep 6, 2024

Markfw said:
Frontier is the worlds fastest supercomputer.

There many different supercomputer architectures. Speed (e.g. petaflops) is a poor measure of any unless you have a very large embarrassingly parallel vector computation.

igor_kavinski · Sep 6, 2024

Hermetian said:
No you can't, because you are lacking a Wolfram license, which would have to be the extended version for extra parallel kernels.

My Epyc is currently configured for two cores per CCD without SMT, so that's 16 cores available. You can install your licensed Mathematica on my machine, test it and then uninstall Mathematica once you are done.

LightningZ71 · Sep 6, 2024

I will reierate my earlier suggestion: before you drop a bunch of money on this project, use tools like PerfMon to profile the behavior of your existing system to confirm where your program is experiencing bottlenecking in it's performance. Also, keep in mind that if you go wider with more cores, you will likely get to a point where your SSD becomes a bottleneck as many of the non-enterprise products don't tolerate high levels of random write requests very well. You may want to dig up some Intel Optane drives for the output of your calculations.

igor_kavinski · Sep 6, 2024

LightningZ71 said:
You may want to dig up some Intel Optane drives for the output of your calculations.

Even that may not be needed. If suppose out of 128GB RAM, there is maybe 50GB left free, he can create a virtual RAM drive using imdisk and use that for his storage needs. Once finished, just copy from RAM drive back to SSD for permanent storage of the results.

Hermetian · Sep 6, 2024

LightningZ71 said:
I will reierate my earlier suggestion: before you drop a bunch of money on this project, use tools like PerfMon to profile the behavior of your existing system to confirm where your program is experiencing bottlenecking in it's performance.

Done. I also located a built in Wolfram routine to replace the offending code.

Hermetian · Sep 6, 2024

igor_kavinski said:
You can install your licensed Mathematica on my machine ...

No I can't.
Further, I'm waiting for newer hardware offerings from manufacturers before benchmarking anything.

igor_kavinski said:
he can create a virtual RAM drive ...

Optimized SSD is sufficient.

igor_kavinski · Sep 6, 2024

Hermetian said:
Further, I'm waiting for newer hardware offerings from manufacturers before benchmarking anything.

Turin (Zen 5 Epyc) or Shimada Peak (Zen 5 Threadripper) ?

If you are waiting for anything from Intel side, their workstation platform's only benefit is AVX-512 and AMX instructions which Mathematica doesn't use or Arrow Lake (Core Ultra Series 2) which has E-cores. If it's Intel you prefer, I would recommend not going with their workstation CPUs since Core Ultra Series 2 will offer MUCH better performance, even with E-cores enabled.

Hermetian · Sep 6, 2024

igor_kavinski said:
If you are waiting for anything from Intel side ...

I came to this forum because I previously depended on AnandTech news briefs for up to date industry information. They closed shop last month.

I imagine the regulars on this forum are accustom to looky-loos asking for advice on a new system, and thus the responses I've received pointing at systems I could buy now. But I'm in no hurry.

Additionally, in the high-performance computing world there's a phenomenon we call "killing a poor algorithm with a big machine." I want no part of that. It is what the current trend of thread eaters for AI is all about.

Nothingness · Sep 6, 2024

Hermetian said:
Additionally, in the high-performance computing world there's a phenomenon we call "killing a poor algorithm with a big machine." I want no part of that. It is what the current trend of thread eaters for AI is all about.

This reminds me of the old saying "premature optimization is the root of all evil". It basically says the same: don't start tuning things until you've chosen the right algorithm. This ultimately applies to the choice of system and implementation language.

MS_AT · Sep 6, 2024

Nothingness said:
This reminds me of the old saying "premature optimization is the root of all evil". It basically says the same: don't start tuning things until you've chosen the right algorithm. This ultimately applies to the choice of system and implementation language.

Original quote with wider context:

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

Posting just for reference purposes

igor_kavinski · Sep 6, 2024

My computing purchases have always been "let's buy a bazooka to kill an annoying cockroach"

Speaking from personal experience because I spent YEARS with a Core i3-2100 souped up with 32GB RAM and a $300 Z77 mobo and pagefile turned OFF and yet did not feel the system was slow

My current 12700K with DDR5-7000 and Epyc 128 thread CPU with 128GB RAM could be called World Ending Mega Space Bazookas in comparison

Hermetian · Sep 6, 2024

There are N-body problems in astronomy for which a well-crafted algorithm on a hypercube is the most efficient known solution. And yet, some of the most challenging problems in petroleum hydrology fail there but do very well on systems like the Frontier. And so on. Finally, I'll mention stochastic optimization problems on energy landscapes for which quantum computers would be ideal.

LightningZ71 · Sep 6, 2024

Hermetian said:
Done. I also located a built in Wolfram routine to replace the offending code.

i'm personally curious as to your findings with PerfMon. Using it well is an art unto itself.

Hermetian · Sep 6, 2024

LightningZ71 said:
i'm personally curious as to your findings with PerfMon. Using it well is an art unto itself.

At the moment I'm driving a i7-1165G7 laptop with 4 cores, 16 GB RAM, and 12 MB cache. The non-parallel version of my code runs in turbo mode at about 4.6 GHz. When performing a fixed-length regex search on a String of length ~14 M bytes, I see 5 interrupts of about 0.015 seconds -- which I assume are cache related. The total run time of the search is 22 to 26 seconds, returning about 1.7 M "candidates" to be vetted. The vetting was taking 20-30 minutes. However, I discovered a way to use the built-in function SequenceAlignment and obtain the same results in 40 seconds.

Hermetian · Sep 6, 2024

@igor_kavinski
You are in time zone GMT+10 ?

igor_kavinski · Sep 6, 2024

Hermetian said:
@igor_kavinski
You are in time zone GMT+10 ?

GMT+4 : Abu Dhabi

Hermetian · Sep 6, 2024

igor_kavinski said:
GMT+4 : Abu Dhabi

GMT-7 : Vista CA

LightningZ71 · Sep 6, 2024

Hermetian said:
At the moment I'm driving a i7-1165G7 laptop with 4 cores, 16 GB RAM, and 12 MB cache. The non-parallel version of my code runs in turbo mode at about 4.6 GHz. When performing a fixed-length regex search on a String of length ~14 M bytes, I see 5 interrupts of about 0.015 seconds -- which I assume are cache related. The total run time of the search is 22 to 26 seconds, returning about 1.7 M "candidates" to be vetted. The vetting was taking 20-30 minutes. However, I discovered a way to use the built-in function SequenceAlignment and obtain the same results in 40 seconds.

Thank you for that. That speedup is fantastic! When you apply it to your 10 series processor running 9 instances, you will likely find other bottlenecks. The cacheing structure and amounts are notably different which could have a negative affect on the performance of SequenceAligbment. In addition, the data throughput of the whole program will increase dramatically, potentially causing a bottleneck in RAM or writing to the SSD.

Scaling these big compute projects is always an adventure.

Hermetian · Sep 6, 2024

LightningZ71 said:
When you apply it to your 10 series processor running 9 instances, you will likely find other bottlenecks.

Absolutely! 🙂

LightningZ71 said:
The caching structure and amounts are notably different which could have a negative affect on the performance of SequenceAlignment.

SequenceAlignment is used to compare the actual primer (String of 16 or so letters) with each of the candidates returned by the regular expression search. So I don't believe it will be affected that way.

LightningZ71 said:
In addition, the data throughput of the whole program will increase dramatically, potentially causing a bottleneck in RAM or writing to the SSD.

Yes, I've previously observed the bottleneck caused by independent kernels (processes) in different cores all trying to pipe data from memory through 2 channels. I'm looking forward to seeing how the application scales with the memory upgrade.

LightningZ71 said:
Scaling these big compute projects is always an adventure.

You bet. Sometimes changing the ordering of the nested iterations offers a surprise improvement.

LightningZ71 · Sep 6, 2024

On the 1165g7, using the SequenceAlignment tuned function, how many instances were you running to get the desired output in 40 seconds?

Hermetian · Sep 6, 2024

LightningZ71 said:
On the 1165g7, using the SequenceAlignment tuned function, how many instances were you running to get the desired output in 40 seconds?

Well, zero : I'm presently running in immediate mode.

Hermetian · Sep 7, 2024

For an example of functional programming in Wolfram, see "vettedCandidates" computation in the "FOURTH UPDATE" posted on this page: https://community.wolfram.com/groups/-/m/t/3269106

StefanR5R · Sep 7, 2024

cellarnoise said:
@StefanR5R knows science

Maybe, or maybe I don't; in any case I don't know a thing about genomics. At occasions at which I was confronted with regexes or bitmaps or graphs, it was only with tiny datasets. Also, I never used Mathematica.

From what was discussed so far, I for one am in the dark whether or not there are notable data dependencies between the program threads.

If yes, then maybe more last-level CPU cache could help; unified cache (Ryzen 7 7800X3D) perhaps more so than segmented cache (various CPU choices; but even segmented cache may help).
If no, then maybe more memory channels could help to scale the application up. (Obviously, up to 12 DDR5 channels are available right now for single-socket machines, up to 24 for dual-socket machines. Try on a rented VM before a buy, *if* it is possible to transfer the Mathematica license to a cloud VM.) But without much dependencies between threads, the real barrier to scaling up is probably software licensing cost.

But all this comes after finding out whether or not the algorithm can be improved, of course.

LightningZ71 · Sep 7, 2024

I kinda had a feel for whatcwas going on with the original approach, but with the new SA function in Wolfram, I really don't have a handle on it's system load profile. It seems to me that, with such a drastic seedup, he might be well served by a Tiger Lake-H 8 core processor on a decently cooled platform with the mitigations disabled to restore proper I/O performance. Those laptops and microdesktops are relatively cheap these days and have homogenous cores. 24MB of L3 isn't anything to sneeze at either. If he was seeing weeks under the old algorithm on the Comet Lake 10 core, he might be able to use the new technique overnight.

Question CPUs for shared memory parallel computing

Member

Member

Lifer

Golden Member

Lifer

Member

Member

Lifer

Member

Diamond Member

Senior member

Lifer

Member

Golden Member

Member

Member

Lifer

Member

Golden Member

Member

Golden Member

Member

Member

Elite Member

Golden Member