Question CPUs for shared memory parallel computing

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
I'm a computational scientist who writes shared memory parallel programs with Mathematica in the MS Windows environment. I'd like to keep up with hardware trends that could improve my computing platform.

At present I have a liquid-cooled i9-10900KF processor with 16GB RAM and 1TB SSD. The cores have been synchronized and overclocked (with Intel's temperature control) to about 5 GHz. The programs I write run anywhere from a few minutes to a few weeks. The longer runs are typically at 70% to 80% total CPU and Memory load, with the core temperatures in the low 80°C range.
 
Last edited:

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,389
15,513
136
I'm a computational scientist who writes shared memory parallel programs with Mathematica in the MS Windows environment. I'd like to keep up with hardware trends that could improve my computing platform.

At present I have a liquid-cooled i9-10900KF processor with 16GB RAM and 1TB SSD. The cores have been synchronized and overclocked (with Intel's temperature control) to about 5 GHz. The programs I write run anywhere from a few minutes to a few weeks. The longer runs are typically at 70% to 80% total CPU and Memory load, with the core temperatures in the low 80°C range.
Very simple. AMD 9950x, full avx-512 and with your liquid cooling solution 5.7-6.0 is possible IMO. @Det0x Sound right to you ?

Use ddr5 6000 cl 30 or lower ram. You can reuse your ssd, but I am sure its not PCIE 4.0. If you process relies on a lot of IO, change the ssd.
 

Nothingness

Diamond Member
Jul 3, 2013
3,137
2,153
136
Do you know if Mathematica uses AVX-512? If so then, I think an AMD CPU might be a good choice.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,389
15,513
136
Do you know if Mathematica uses AVX-512? If so then, I think an AMD CPU might be a good choice.
Even if its doesn't it will still be light years faster than a 10900KF and it won't destruct in a month like A 14900k/kf/ks would. Not sure on the GHZ speed for sustained load with WC, but detx0 can answer that.
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
Very simple. AMD 9950x, full avx-512 and with your liquid cooling solution 5.7-6.0 is possible IMO. @Det0x Sound right to you ?

Use ddr5 6000 cl 30 or lower ram. You can reuse your ssd, but I am sure its not PCIE 4.0. If you process relies on a lot of IO, change the ssd.
It's an interesting CPU - with 16 cores, 5.7 GHz max clock, 192 GB max RAM, and a disappointing 2 memory channels (same as current).

It has the potential to cut the total run time of most of my parallel programs by 9/14ths. I also have a class of problems with larger memory requirements per core. At present they will only "fit" in 6 cores due to my current 16GB installed RAM. I'm at that well-known junction of either upgrading RAM or upgrading the entire system.
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
RAM only gets you so far but it's cheaper than a rebuild. Moving to AMD will unleash the potential and free up time. Besides that you get to skip the Intel hybrid cores which would likely be an issue. DDR 5 should speed things up as well. You can get 32GB under $100 easily.
Thanks. I'm not a fan of the hybrid cores either.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,389
15,513
136
In terms of CPU clock: Light years, no. Twice as fast, no. 5.7/5.0, yes.

In terms of total run time of my parallel computations under 30MB input data size per core: 14/9ths faster.
Most math intensive programs use some type of special code sse4, avx, avx-512, etc. I was assuming it used one of those, hopefully avx-512 (the fastest of the special codesets)

For math intensive programs in the DC world, we use AMD a lot of the time. If you want more than 16 cores, more than 2 memory channels and a lot of ram, consider a Genoa system. The only one that is relatively affordable is the Genoa. 64 cores, 128 threads, 192 gig ram (what I chose to not break the bank) 12 memory channels @ DDR5 4800 and 3.5 ghz all core minimum. THAT will be light years faster. But about $4,500 for the whole thing for a retail CPU. If you can find en ES chip, take $1400 off of that figure.
 
Reactions: Tlh97

maddie

Diamond Member
Jul 18, 2010
4,932
5,075
136
Without going full server, the upcoming Strix Halo with 16 Zen 5 cores and 256 bit ram might be an interesting choice, The problem is, it's months from release.
 
Reactions: Tlh97 and Hermitian

Schmide

Diamond Member
Mar 7, 2002
5,596
733
126
My point being. If your algorithm has a ton of dependencies and is constantly syncing memory, a faster processor is still going to be memory bottlenecked. The 70% CPU utilization kind of hints to that. 16gb memory isn't a lot these days. 32-64gb is relatively cheap, the balance being 64gb often has looser timings, 32gb faster. So some algorithms may like faster memory, others more.

If the algorithm has a moderate amount of parallelism, GPU compute will win over a CPU more often than not. The wolfman engine supports CUDA and openCL, though they will probably make you pay though.

SSDs though not as cheap as they were a few months ago, can go a long way to to alleviating large dataset thrashing. They are starting to make SLC SSDs that have good longevity for compute tasks. Those are still server only. Still a high speed dedicated SSD with large command queues could go a long way. Remember the more bits per cell will have less longevity for these tasks. SLC = single bit per cell. MLC multi level cell 2 bits. TLC 3 QLC 4. etc

As always evaluate your algorithm and clean up your darn headers before impulse buying more compute.
 
Last edited:

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,389
15,513
136
Well, if you want more than 2 memory channels, your choices are very limited. If you do not want to consider server CPUs setups, than your choices are even more limited. You are not going to find any current CPUs that have more sustained speed than a 9950x that don't degrade and certainly none other than that, that can maintain that speed for 24/7 for weeks. That cpu has a lot of internal improvements. You can read the Zen 5 technical thread for more information. As for memory. DDR5 6000 at 1T is "sort of" quad channel. as DDR5 is dual channel I read, and it 2 channels of that. DDR5 6000 is already more than twice the speed of DDR4 2933. and 48 gig of DDR4 6000 CL30 is $219\

I have provided the best options that I can think of, so good luck, I am done here.
 
Last edited:
Reactions: Tlh97 and Hermitian

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
My point being. ...

What you've said is true. Keep in mind I have hundreds of algorithms, using only a few in any particular program.

The 70% to 80% load is due to my choice in divide and conquer parallelism and the number of cores I've assigned for parallelism. I typically use 9 and leave one for the operating system. As mentioned above, there are some computations I'd like to do but no matter how I slice it they will only "fit" on 6 cores. Your advice to upgrade memory is very good for the medium term, I got a quote for 48GB of DDR4-2933 installed for $138. However my "wanna be" computation would only drop from 144 to 96 days. Oh well. 35 years ago my thesis computation ran for 3 days on a Cray XMP. It now runs in twenty seconds on my cell phone.

I've no interest in GPU computing, regarding it as a distributed processing paradigm. It's ironic because 30 years ago I was a member of the MPI forum. But so was Intel, and they mined all the use cases -- putting them into shared memory hardware. Nowadays it's commonplace among manufacturers.

As a researcher whose office is at home, parallel computing on a single device is the most efficient model. I look forward to hearing about new CPUs here.
 

LightningZ71

Golden Member
Mar 10, 2017
1,910
2,260
136
The input data size per core is under 30MB? What is the active memory footprint for each process at any given moment? Not the entire dataset size, but how much memory is it juggling at any given moment?

The reason that I ask this is that the 7800X3D has 96MB of L3 cache and an additional 1MB per core of L2. That's a lot of fast memory that's local to each core. Depending on how mathematica handles the computations that you are running, it could make a massive increase in calculation throughput.

Also, while the 7800X3D is only 8 cores, it has them all on one chip, in a single memory domain for L3 and doesn't have to take the hit for having to talk to cores on a different chip. Additionally, the biggest improvement on the 9000 series processors is their AVX-512 throughput increases, however, those are largely memory bound and won't help mathematica at all.

My fear with the 12 and 16 core Ryzen processors for your use case is that the shared memory space model you are using will thrash the memory traffic between the cores on the I/O die and you'll spend more time waiting or reading from RAM than actually computing.

A lot of this depends on your budget.
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
Thank you for thinking about my computing environment.

One of my areas of study are the DNA chromosomes of perennial plants. My idea of a medium size chromosome is 20 to 30 MB -- literally a string of that length composed of the letters A,T,G,C. Per chromosome and per "marker", I need to perform 70 to 210 regular expression searches for substrings of size 16 to 512 letters, each of which will produce many coordinates of matches, all of which are then annotated and written to disk per search for post-processing. All of this is regular expression dependent and thus happening asynchronously.

For this application on a 30 MB chromosome and a single "marker", I clock the shortest total run time by paralyzing on the searches across 9 "kernels" on my 10 core CPU. For example, a total of 72 searches would be calculated by 8 iterations per kernel (a Mathematica dispatched process). The memory utilization for the first few minutes is about 80% of my 16 GB (according to Task Manager, incl. the OS) and then drops to about 70% for an hour or so, and then drops further as each kernel finishes its task. I estimate the OS and Mathematica memory overhead at about 17%.

A production run of this application looks like all the chromosomes of a single specimen (often 13), and the same set of markers per chromosome (at least a dozen), and depending on the length of each marker anywhere from 70 to 210 regular expression searches per marker.

Given the memory quotes I received today, I'm going to upgrade to 128GB of DDR4-2933 (what the processor is designed for). This might shake up my benchmarks a bit. At the very least I'll be able to increase the number of compute kernels. My license currently allows up to 14.
 
Last edited:

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
@Markfw
Mathematics is best described as "the language of measurement". There are categories of mathematical calculations that do not benefit from vector (or tensor) processing -- unless great pains are taken to transform the desired computation into linear algebra operations on lists of integers.

For example, there are many algebras and provably every language is an algebra. In this context, it is interesting to compute the semantic distances between words, phrases, sentences, and entire documents in a given language. This is a topology problem involving a metric space on an algebra. The problem size is enormous: a language of size N has N*(N-1)/2 "edges" in this space. However, it is almost within the grasp of today's supercomputers.
 
Reactions: Tlh97 and Gideon

jur

Member
Nov 23, 2016
31
11
81
Mathematica is supposedly using Intel MKL, which does support avx512. The question is just if Intel enabled AVX2, AVX512 support on AMD cpus.

I've been using Mathematica in my student years and I can't say good things performance wise, except if you are really using only highly optimized internal functions or calls that go directly to linked libraries (like MKL). If the calculation takes weeks, then it might be a good idea to let students rewrite it in c++ with latest mkl / blas libs. You can even try oneapi and compile for other architectures.
 
Reactions: dr1337
Jul 27, 2020
20,898
14,487
146
If it's easy to do a test run for one of your intensive programs, I could do it on my 128 thread Zen 2 Epyc ES and see if your program is able to take advantage of all those threads to run faster. 32 threads of 9950X would easily beat my old gen 128 threads so that could make your decision to get the 9950X easy.
 
Reactions: Tlh97 and Hermitian
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |