Question CPUs for shared memory parallel computing

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Hermetian

Member
Sep 1, 2024
71
54
46
frostconcepts.org
I can even let you access my 128 thread eight channel Epyc over Anydesk or any other remote software of your choice
No you can't, because you are lacking a Wolfram license, which would have to be the extended version for extra parallel kernels.

Two things are very clear about my situation, which I've previously stated in this discussion thread: (1) I will have to upgrade my hardware to tackle a larger dataset which is on the back burner for now (2) There's no currently available hardware in my price range that would significantly improve the runtime of that dataset.

In the meantime I'm improving the algorithm, which I'm discussing on the Wolfram community page linked to somewhere above.
 
Jul 27, 2020
19,613
13,476
146
No you can't, because you are lacking a Wolfram license, which would have to be the extended version for extra parallel kernels.
My Epyc is currently configured for two cores per CCD without SMT, so that's 16 cores available. You can install your licensed Mathematica on my machine, test it and then uninstall Mathematica once you are done.
 

LightningZ71

Golden Member
Mar 10, 2017
1,781
2,135
136
I will reierate my earlier suggestion: before you drop a bunch of money on this project, use tools like PerfMon to profile the behavior of your existing system to confirm where your program is experiencing bottlenecking in it's performance. Also, keep in mind that if you go wider with more cores, you will likely get to a point where your SSD becomes a bottleneck as many of the non-enterprise products don't tolerate high levels of random write requests very well. You may want to dig up some Intel Optane drives for the output of your calculations.
 
Jul 27, 2020
19,613
13,476
146
You may want to dig up some Intel Optane drives for the output of your calculations.
Even that may not be needed. If suppose out of 128GB RAM, there is maybe 50GB left free, he can create a virtual RAM drive using imdisk and use that for his storage needs. Once finished, just copy from RAM drive back to SSD for permanent storage of the results.
 

Hermetian

Member
Sep 1, 2024
71
54
46
frostconcepts.org
I will reierate my earlier suggestion: before you drop a bunch of money on this project, use tools like PerfMon to profile the behavior of your existing system to confirm where your program is experiencing bottlenecking in it's performance.
Done. I also located a built in Wolfram routine to replace the offending code.
 
Jul 27, 2020
19,613
13,476
146
Further, I'm waiting for newer hardware offerings from manufacturers before benchmarking anything.
Turin (Zen 5 Epyc) or Shimada Peak (Zen 5 Threadripper) ?

If you are waiting for anything from Intel side, their workstation platform's only benefit is AVX-512 and AMX instructions which Mathematica doesn't use or Arrow Lake (Core Ultra Series 2) which has E-cores. If it's Intel you prefer, I would recommend not going with their workstation CPUs since Core Ultra Series 2 will offer MUCH better performance, even with E-cores enabled.
 

Hermetian

Member
Sep 1, 2024
71
54
46
frostconcepts.org
If you are waiting for anything from Intel side ...
I came to this forum because I previously depended on AnandTech news briefs for up to date industry information. They closed shop last month.

I imagine the regulars on this forum are accustom to looky-loos asking for advice on a new system, and thus the responses I've received pointing at systems I could buy now. But I'm in no hurry.

Additionally, in the high-performance computing world there's a phenomenon we call "killing a poor algorithm with a big machine." I want no part of that. It is what the current trend of thread eaters for AI is all about.
 

Nothingness

Diamond Member
Jul 3, 2013
3,029
1,971
136
Additionally, in the high-performance computing world there's a phenomenon we call "killing a poor algorithm with a big machine." I want no part of that. It is what the current trend of thread eaters for AI is all about.
This reminds me of the old saying "premature optimization is the root of all evil". It basically says the same: don't start tuning things until you've chosen the right algorithm. This ultimately applies to the choice of system and implementation language.
 
Reactions: Hermetian

MS_AT

Senior member
Jul 15, 2024
202
467
96
This reminds me of the old saying "premature optimization is the root of all evil". It basically says the same: don't start tuning things until you've chosen the right algorithm. This ultimately applies to the choice of system and implementation language.
Original quote with wider context:
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
Posting just for reference purposes
 
Jul 27, 2020
19,613
13,476
146
My computing purchases have always been "let's buy a bazooka to kill an annoying cockroach"

Speaking from personal experience because I spent YEARS with a Core i3-2100 souped up with 32GB RAM and a $300 Z77 mobo and pagefile turned OFF and yet did not feel the system was slow

My current 12700K with DDR5-7000 and Epyc 128 thread CPU with 128GB RAM could be called World Ending Mega Space Bazookas in comparison
 

Hermetian

Member
Sep 1, 2024
71
54
46
frostconcepts.org
There are N-body problems in astronomy for which a well-crafted algorithm on a hypercube is the most efficient known solution. And yet, some of the most challenging problems in petroleum hydrology fail there but do very well on systems like the Frontier. And so on. Finally, I'll mention stochastic optimization problems on energy landscapes for which quantum computers would be ideal.
 
Reactions: Nothingness

Hermetian

Member
Sep 1, 2024
71
54
46
frostconcepts.org
i'm personally curious as to your findings with PerfMon. Using it well is an art unto itself.
At the moment I'm driving a i7-1165G7 laptop with 4 cores, 16 GB RAM, and 12 MB cache. The non-parallel version of my code runs in turbo mode at about 4.6 GHz. When performing a fixed-length regex search on a String of length ~14 M bytes, I see 5 interrupts of about 0.015 seconds -- which I assume are cache related. The total run time of the search is 22 to 26 seconds, returning about 1.7 M "candidates" to be vetted. The vetting was taking 20-30 minutes. However, I discovered a way to use the built-in function SequenceAlignment and obtain the same results in 40 seconds.
 

LightningZ71

Golden Member
Mar 10, 2017
1,781
2,135
136
At the moment I'm driving a i7-1165G7 laptop with 4 cores, 16 GB RAM, and 12 MB cache. The non-parallel version of my code runs in turbo mode at about 4.6 GHz. When performing a fixed-length regex search on a String of length ~14 M bytes, I see 5 interrupts of about 0.015 seconds -- which I assume are cache related. The total run time of the search is 22 to 26 seconds, returning about 1.7 M "candidates" to be vetted. The vetting was taking 20-30 minutes. However, I discovered a way to use the built-in function SequenceAlignment and obtain the same results in 40 seconds.
Thank you for that. That speedup is fantastic! When you apply it to your 10 series processor running 9 instances, you will likely find other bottlenecks. The cacheing structure and amounts are notably different which could have a negative affect on the performance of SequenceAligbment. In addition, the data throughput of the whole program will increase dramatically, potentially causing a bottleneck in RAM or writing to the SSD.

Scaling these big compute projects is always an adventure.
 

Hermetian

Member
Sep 1, 2024
71
54
46
frostconcepts.org
When you apply it to your 10 series processor running 9 instances, you will likely find other bottlenecks.
Absolutely! 🙂
The caching structure and amounts are notably different which could have a negative affect on the performance of SequenceAlignment.
SequenceAlignment is used to compare the actual primer (String of 16 or so letters) with each of the candidates returned by the regular expression search. So I don't believe it will be affected that way.
In addition, the data throughput of the whole program will increase dramatically, potentially causing a bottleneck in RAM or writing to the SSD.
Yes, I've previously observed the bottleneck caused by independent kernels (processes) in different cores all trying to pipe data from memory through 2 channels. I'm looking forward to seeing how the application scales with the memory upgrade.
Scaling these big compute projects is always an adventure.
You bet. Sometimes changing the ordering of the nested iterations offers a surprise improvement.
 

LightningZ71

Golden Member
Mar 10, 2017
1,781
2,135
136
On the 1165g7, using the SequenceAlignment tuned function, how many instances were you running to get the desired output in 40 seconds?
 

StefanR5R

Elite Member
Dec 10, 2016
5,888
8,757
136
@StefanR5R knows science
Maybe, or maybe I don't; in any case I don't know a thing about genomics. At occasions at which I was confronted with regexes or bitmaps or graphs, it was only with tiny datasets. Also, I never used Mathematica.

From what was discussed so far, I for one am in the dark whether or not there are notable data dependencies between the program threads.
  • If yes, then maybe more last-level CPU cache could help; unified cache (Ryzen 7 7800X3D) perhaps more so than segmented cache (various CPU choices; but even segmented cache may help).
  • If no, then maybe more memory channels could help to scale the application up. (Obviously, up to 12 DDR5 channels are available right now for single-socket machines, up to 24 for dual-socket machines. Try on a rented VM before a buy, *if* it is possible to transfer the Mathematica license to a cloud VM.) But without much dependencies between threads, the real barrier to scaling up is probably software licensing cost.
But all this comes after finding out whether or not the algorithm can be improved, of course.
 

LightningZ71

Golden Member
Mar 10, 2017
1,781
2,135
136
I kinda had a feel for whatcwas going on with the original approach, but with the new SA function in Wolfram, I really don't have a handle on it's system load profile. It seems to me that, with such a drastic seedup, he might be well served by a Tiger Lake-H 8 core processor on a decently cooled platform with the mitigations disabled to restore proper I/O performance. Those laptops and microdesktops are relatively cheap these days and have homogenous cores. 24MB of L3 isn't anything to sneeze at either. If he was seeing weeks under the old algorithm on the Comet Lake 10 core, he might be able to use the new technique overnight.
 
Reactions: Hermetian
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |