Question CPUs for shared memory parallel computing

Hermitian · Sep 1, 2024

I'm a computational scientist who writes shared memory parallel programs with Mathematica in the MS Windows environment. I'd like to keep up with hardware trends that could improve my computing platform.

At present I have a liquid-cooled i9-10900KF processor with 16GB RAM and 1TB SSD. The cores have been synchronized and overclocked (with Intel's temperature control) to about 5 GHz. The programs I write run anywhere from a few minutes to a few weeks. The longer runs are typically at 70% to 80% total CPU and Memory load, with the core temperatures in the low 80°C range.

Markfw · Sep 1, 2024

Hermetian said:
I'm a computational scientist who writes shared memory parallel programs with Mathematica in the MS Windows environment. I'd like to keep up with hardware trends that could improve my computing platform.

At present I have a liquid-cooled i9-10900KF processor with 16GB RAM and 1TB SSD. The cores have been synchronized and overclocked (with Intel's temperature control) to about 5 GHz. The programs I write run anywhere from a few minutes to a few weeks. The longer runs are typically at 70% to 80% total CPU and Memory load, with the core temperatures in the low 80°C range.

Very simple. AMD 9950x, full avx-512 and with your liquid cooling solution 5.7-6.0 is possible IMO. @Det0x Sound right to you ?

Use ddr5 6000 cl 30 or lower ram. You can reuse your ssd, but I am sure its not PCIE 4.0. If you process relies on a lot of IO, change the ssd.

Nothingness · Sep 1, 2024

Do you know if Mathematica uses AVX-512? If so then, I think an AMD CPU might be a good choice.

Markfw · Sep 1, 2024

Nothingness said:
Do you know if Mathematica uses AVX-512? If so then, I think an AMD CPU might be a good choice.

Even if its doesn't it will still be light years faster than a 10900KF and it won't destruct in a month like A 14900k/kf/ks would. Not sure on the GHZ speed for sustained load with WC, but detx0 can answer that.

Hermitian · Sep 1, 2024

Markfw said:
Very simple. AMD 9950x, full avx-512 and with your liquid cooling solution 5.7-6.0 is possible IMO. @Det0x Sound right to you ?

Use ddr5 6000 cl 30 or lower ram. You can reuse your ssd, but I am sure its not PCIE 4.0. If you process relies on a lot of IO, change the ssd.

It's an interesting CPU - with 16 cores, 5.7 GHz max clock, 192 GB max RAM, and a disappointing 2 memory channels (same as current).

It has the potential to cut the total run time of most of my parallel programs by 9/14ths. I also have a class of problems with larger memory requirements per core. At present they will only "fit" in 6 cores due to my current 16GB installed RAM. I'm at that well-known junction of either upgrading RAM or upgrading the entire system.

Tech Junky · Sep 1, 2024

Hermetian said:
upgrading the entire system.

RAM only gets you so far but it's cheaper than a rebuild. Moving to AMD will unleash the potential and free up time. Besides that you get to skip the Intel hybrid cores which would likely be an issue. DDR 5 should speed things up as well. You can get 32GB under $100 easily.

Hermitian · Sep 1, 2024

Markfw said:
Even if its doesn't it will still be light years faster than a 10900KF.

In terms of CPU clock: Light years, no. Twice as fast, no. 5.7/5.0, yes.

In terms of total run time of my parallel computations under 30MB input data size per core: 14/9ths faster.

Hermitian · Sep 1, 2024

Nothingness said:
Do you know if Mathematica uses AVX-512? If so then, I think an AMD CPU might be a good choice.

It does not.

Schmide · Sep 1, 2024

Are you using gridMathematica?
Do you have a good GPU?

Hermitian · Sep 1, 2024

Tech Junky said:
RAM only gets you so far but it's cheaper than a rebuild. Moving to AMD will unleash the potential and free up time. Besides that you get to skip the Intel hybrid cores which would likely be an issue. DDR 5 should speed things up as well. You can get 32GB under $100 easily.

Thanks. I'm not a fan of the hybrid cores either.

Hermitian · Sep 1, 2024

Schmide said:
Are you using gridMathematica?
Do you have a good GPU?

No, I'm using shared memory parallelism functions such as InitKernels, ParallelDo, CloseKernels.

See first post for CPU.

Markfw · Sep 1, 2024

Hermetian said:
In terms of CPU clock: Light years, no. Twice as fast, no. 5.7/5.0, yes.

In terms of total run time of my parallel computations under 30MB input data size per core: 14/9ths faster.

Most math intensive programs use some type of special code sse4, avx, avx-512, etc. I was assuming it used one of those, hopefully avx-512 (the fastest of the special codesets)

For math intensive programs in the DC world, we use AMD a lot of the time. If you want more than 16 cores, more than 2 memory channels and a lot of ram, consider a Genoa system. The only one that is relatively affordable is the Genoa. 64 cores, 128 threads, 192 gig ram (what I chose to not break the bank) 12 memory channels @ DDR5 4800 and 3.5 ghz all core minimum. THAT will be light years faster. But about $4,500 for the whole thing for a retail CPU. If you can find en ES chip, take $1400 off of that figure.

maddie · Sep 1, 2024

Without going full server, the upcoming Strix Halo with 16 Zen 5 cores and 256 bit ram might be an interesting choice, The problem is, it's months from release.

Hermitian · Sep 1, 2024

Markfw said:
For math intensive programs in the DC world

I'm not in the Distributed Computing world.

Markfw said:
THAT will be light years faster.

As a physicist, I guarantee it is not.

Hermitian · Sep 1, 2024

maddie said:
Without going full server, the upcoming Strix Halo with 16 Zen 5 cores and 256 bit ram might be an interesting choice, The problem is, it's months from release.

Thanks, I'll check it out.

Schmide · Sep 1, 2024

My point being. If your algorithm has a ton of dependencies and is constantly syncing memory, a faster processor is still going to be memory bottlenecked. The 70% CPU utilization kind of hints to that. 16gb memory isn't a lot these days. 32-64gb is relatively cheap, the balance being 64gb often has looser timings, 32gb faster. So some algorithms may like faster memory, others more.

If the algorithm has a moderate amount of parallelism, GPU compute will win over a CPU more often than not. The wolfman engine supports CUDA and openCL, though they will probably make you pay though.

SSDs though not as cheap as they were a few months ago, can go a long way to to alleviating large dataset thrashing. They are starting to make SLC SSDs that have good longevity for compute tasks. Those are still server only. Still a high speed dedicated SSD with large command queues could go a long way. Remember the more bits per cell will have less longevity for these tasks. SLC = single bit per cell. MLC multi level cell 2 bits. TLC 3 QLC 4. etc

As always evaluate your algorithm and clean up your darn headers before impulse buying more compute.

Markfw · Sep 1, 2024

Well, if you want more than 2 memory channels, your choices are very limited. If you do not want to consider server CPUs setups, than your choices are even more limited. You are not going to find any current CPUs that have more sustained speed than a 9950x that don't degrade and certainly none other than that, that can maintain that speed for 24/7 for weeks. That cpu has a lot of internal improvements. You can read the Zen 5 technical thread for more information. As for memory. DDR5 6000 at 1T is "sort of" quad channel. as DDR5 is dual channel I read, and it 2 channels of that. DDR5 6000 is already more than twice the speed of DDR4 2933. and 48 gig of DDR4 6000 CL30 is $219\

https://www.newegg.com/corsair-48gb-ddr5-6000/p/N82E16820982095

I have provided the best options that I can think of, so good luck, I am done here.

Hermitian · Sep 1, 2024

Schmide said:
My point being. ...

What you've said is true. Keep in mind I have hundreds of algorithms, using only a few in any particular program.

The 70% to 80% load is due to my choice in divide and conquer parallelism and the number of cores I've assigned for parallelism. I typically use 9 and leave one for the operating system. As mentioned above, there are some computations I'd like to do but no matter how I slice it they will only "fit" on 6 cores. Your advice to upgrade memory is very good for the medium term, I got a quote for 48GB of DDR4-2933 installed for $138. However my "wanna be" computation would only drop from 144 to 96 days. Oh well. 35 years ago my thesis computation ran for 3 days on a Cray XMP. It now runs in twenty seconds on my cell phone.

I've no interest in GPU computing, regarding it as a distributed processing paradigm. It's ironic because 30 years ago I was a member of the MPI forum. But so was Intel, and they mined all the use cases -- putting them into shared memory hardware. Nowadays it's commonplace among manufacturers.

As a researcher whose office is at home, parallel computing on a single device is the most efficient model. I look forward to hearing about new CPUs here.

NTMBK · Sep 1, 2024

Hermetian said:
As a physicist, I guarantee it is not.

But it could conceivably be light years longer. Dimensionally speaking.

LightningZ71 · Sep 1, 2024

The input data size per core is under 30MB? What is the active memory footprint for each process at any given moment? Not the entire dataset size, but how much memory is it juggling at any given moment?

The reason that I ask this is that the 7800X3D has 96MB of L3 cache and an additional 1MB per core of L2. That's a lot of fast memory that's local to each core. Depending on how mathematica handles the computations that you are running, it could make a massive increase in calculation throughput.

Also, while the 7800X3D is only 8 cores, it has them all on one chip, in a single memory domain for L3 and doesn't have to take the hit for having to talk to cores on a different chip. Additionally, the biggest improvement on the 9000 series processors is their AVX-512 throughput increases, however, those are largely memory bound and won't help mathematica at all.

My fear with the 12 and 16 core Ryzen processors for your use case is that the shared memory space model you are using will thrash the memory traffic between the cores on the I/O die and you'll spend more time waiting or reading from RAM than actually computing.

A lot of this depends on your budget.

Hermitian · Sep 2, 2024

LightningZ71 said:
...

Thank you for thinking about my computing environment.

One of my areas of study are the DNA chromosomes of perennial plants. My idea of a medium size chromosome is 20 to 30 MB -- literally a string of that length composed of the letters A,T,G,C. Per chromosome and per "marker", I need to perform 70 to 210 regular expression searches for substrings of size 16 to 512 letters, each of which will produce many coordinates of matches, all of which are then annotated and written to disk per search for post-processing. All of this is regular expression dependent and thus happening asynchronously.

For this application on a 30 MB chromosome and a single "marker", I clock the shortest total run time by paralyzing on the searches across 9 "kernels" on my 10 core CPU. For example, a total of 72 searches would be calculated by 8 iterations per kernel (a Mathematica dispatched process). The memory utilization for the first few minutes is about 80% of my 16 GB (according to Task Manager, incl. the OS) and then drops to about 70% for an hour or so, and then drops further as each kernel finishes its task. I estimate the OS and Mathematica memory overhead at about 17%.

A production run of this application looks like all the chromosomes of a single specimen (often 13), and the same set of markers per chromosome (at least a dozen), and depending on the length of each marker anywhere from 70 to 210 regular expression searches per marker.

Given the memory quotes I received today, I'm going to upgrade to 128GB of DDR4-2933 (what the processor is designed for). This might shake up my benchmarks a bit. At the very least I'll be able to increase the number of compute kernels. My license currently allows up to 14.

Hermitian · Sep 2, 2024

@Markfw
Mathematics is best described as "the language of measurement". There are categories of mathematical calculations that do not benefit from vector (or tensor) processing -- unless great pains are taken to transform the desired computation into linear algebra operations on lists of integers.

For example, there are many algebras and provably every language is an algebra. In this context, it is interesting to compute the semantic distances between words, phrases, sentences, and entire documents in a given language. This is a topology problem involving a metric space on an algebra. The problem size is enormous: a language of size N has N*(N-1)/2 "edges" in this space. However, it is almost within the grasp of today's supercomputers.

jur · Sep 2, 2024

Mathematica is supposedly using Intel MKL, which does support avx512. The question is just if Intel enabled AVX2, AVX512 support on AMD cpus.

I've been using Mathematica in my student years and I can't say good things performance wise, except if you are really using only highly optimized internal functions or calls that go directly to linked libraries (like MKL). If the calculation takes weeks, then it might be a good idea to let students rewrite it in c++ with latest mkl / blas libs. You can even try oneapi and compile for other architectures.

Hermitian · Sep 2, 2024

jur said:
Mathematica is supposedly using Intel MKL, which does support avx512.

I believe Mathematica has disabled avx in its library. There are good reasons not to use it.

I'm running Wolfram (Mathematica) 14.1. I'm not a student nor do I have students. As the saying goes, "old professors never die, they just lose their faculties".

igor_kavinski · Sep 2, 2024

If it's easy to do a test run for one of your intensive programs, I could do it on my 128 thread Zen 2 Epyc ES and see if your program is able to take advantage of all those threads to run faster. 32 threads of 9950X would easily beat my old gen 128 threads so that could make your decision to get the 9950X easy.

Question CPUs for shared memory parallel computing

Member

Moderator Emeritus, Elite Member

Diamond Member

Moderator Emeritus, Elite Member

Member

Diamond Member

Member

Member

Diamond Member

Member

Member

Moderator Emeritus, Elite Member

Diamond Member

Member

Member

Diamond Member

Moderator Emeritus, Elite Member

Member

Lifer

Golden Member

Member

Member

Member

Member

Lifer