Question CPUs for shared memory parallel computing

Page 9 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tup3x

Golden Member
Dec 31, 2016
1,117
1,146
136
Yes.
My hardware has 10 cores capable of two threads each. The CPU is overclocked, typically achieving 5.12 GHz under heavy loads. The board is liquid cooled.
My Windows O/S is set to prioritize applications.
I instructed Mathematica to run 12 parallel kernels, which results in 12 computational subprocesses plus 2 coordinating subprocesses. Each computational subprocess accounted for about 6% of available CPU load until its task was finished. Memory usage rarely exceeded 20% and is no longer a bottleneck. In conjunction with this, the cores ran about 4°C cooler. However, I believe the number of memory channels (2) are still limiting runtime. Keep in mind I've been historically spoiled by supercomputers with hundreds of cores all on the same memory backplane with 1:1 channel to core ratios. Additionally, the number of cores in my system and number of parallel processes permitted by my Mathematica license inhibit faster times to completion -- although the memory channel issue restricts these gains in a less than linear fashion.
Have you actually confirmed that you are memory bandwidth limited? What happens if you halve the memory speed?
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
Have you actually confirmed that you are memory bandwidth limited?
I have clocked the start-to-finish time of parallel processes from within each process, and compared them to the start-to-finish times of singleton parallel processes with the same per-process load. The singleton execution times were about 75% of the parallel execution times.
What happens if you halve the memory speed?
I haven't checked.
 

LightningZ71

Golden Member
Mar 10, 2017
1,910
2,260
136
The memory usage information can be tricky. You need to compare the bandwidth limits of your system to the bandwidth saturation or utilization of your processes. It doesn't matter much if you are using only 10-20 GB of your ram capacity if your program is only using 10GB/sec of actual total memory bandwidth when your system can achieve far more.

While you are at "A" limit of your current setup, you still may not have good data on what that precise limiting factor really is. It could be memory bus throughput, it could be L3 cache capacity, it could be raw compute in one phase and memory throughput in another. If you are hitting a limit, then something will be at 100%.
 
Reactions: Sgraffite

Sgraffite

Member
Jul 4, 2001
117
61
101
I have clocked the start-to-finish time of parallel processes from within each process, and compared them to the start-to-finish times of singleton parallel processes with the same per-process load. The singleton execution times were about 75% of the parallel execution times.
That tells you that having more cores/threads isn't scaling linearly, but it doesn't give any hints to why. It could be software or hardware, or both.
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
That tells you that having more cores/threads isn't scaling linearly, but it doesn't give any hints to why. It could be software or hardware, or both.
This is why I ran many tests with varying parallelism and problem sizes. The latter is three dimensional. There are inflections in the runtime curves. One of them is from singleton to multiple parallel processes.
 

Sgraffite

Member
Jul 4, 2001
117
61
101
This is why I ran many tests with varying parallelism and problem sizes. The latter is three dimensional. There are inflections in the runtime curves. One of them is from singleton to multiple parallel processes.
Seems like it would be worthwhile to prove your theory like LightningZ71 suggested. Choose a scenario where you think it is bandwidth limited and lower your memory speed.
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
Seems like it would be worthwhile to prove your theory like LightningZ71 suggested. Choose a scenario where you think it is bandwidth limited and lower your memory speed.
Have you successfully scaled shared-memory parallel computer programs on hardware platforms with 1:1 to 1:n memory channel ratios?
 
Jul 27, 2020
20,917
14,492
146
Couldn't you cook up a small test set so people here could run that on the trial version of Mathematica and give you some more data on how different CPUs behave and if it would be better to upgrade rather than trying to expend so much effort on optimizing the code on a CPU that is going to be 5 generations old soon (possibly in two days when Arrow Lake Core Ultra 285K gets announced)?
 

MS_AT

Senior member
Jul 15, 2024
365
798
96
Sorry if this makes you feel bad but this is a frickin' $900 devkit embarrassing your huge desktop PC: https://browser.geekbench.com/v6/cpu/compare/8010211?baseline=8143535

In a toy benchmark, that may or may not correlate with the problem at hand But more importantly it would be worth to check how Mathematica runs on WoA. According to the support page there is no official WoA version. So it doesn't really matter how much faster X Elite is in geekbench if it will need to go through emulation layer in Windows
 
Reactions: Nothingness

StefanR5R

Elite Member
Dec 10, 2016
6,057
9,106
136
Caveat: RAM increase only possible by throwing the whole thing away and buying another.
(Also, cooling constrained form factor.)
 
Jul 27, 2020
20,917
14,492
146

Up to 23.8x faster basecalling for DNA sequencing in Oxford Nanopore MinKNOW when compared to the 16-inch MacBook Pro with Core i9, and up to 1.8x faster when compared to the 16-inch MacBook Pro with M1 Pro.
 

StefanR5R

Elite Member
Dec 10, 2016
6,057
9,106
136
Apple said:
Up to 23.8x faster
Nice trolling attempt. ;-) Definitely doesn't sound like CPU processing, but as if one of the accelerators is used. The GPU maybe?
 
Jul 27, 2020
20,917
14,492
146
Against which baseline on i9? Singlethreaded x87?
Have you tried running anything on even the base M1? It is lightning fast in certain workloads.


A 9950X usually completes that Excel benchmark in 9 seconds (and yes, it seems to be a single threaded benchmark).
 

StefanR5R

Elite Member
Dec 10, 2016
6,057
9,106
136
Have you tried running anything on even the base M1?
None of the computing intensive software which I use at work, and almost none of which I use at home runs natively on it. For different reasons, there just happens to be no intent to port the software. — But the software which the OP uses has got a macOS-on-Apple-Silicon port. We lack performance profiling insight though to make good guesses how it would perform there, and if it would be a viable platform for the OP's longer term goal of scaling the dataset size up.
 

Nothingness

Diamond Member
Jul 3, 2013
3,137
2,153
136
None of the computing intensive software which I use at work, and almost none of which I use at home runs natively on it. For different reasons, there just happens to be no intent to port the software.
Out of curiosity, is any of the SW you use available on Intel macOS (I mean recent versions)?
 

StefanR5R

Elite Member
Dec 10, 2016
6,057
9,106
136
Out of curiosity, is any of the SW you use available on Intel macOS (I mean recent versions)?
At my work: Semi-OT, because that's shared memory parallel computing too but mostly floating point, solving of systems of linear equations, and miscellaneous postprocessing.
No Intel macOS ports exist either. To ISVs in my line of work, Apple devices are as significant as Nintendo's and Sony's.

At home in my hobbies: Almost OT, because that's often floating point too, and not a lot of memory sharing between threads. When heavy sharing goes on, it's happening within shared processor caches, not as much in main memory.
There exist a few more software builds for OS X on Intel than there are for macOS on Apple Silicon, but I have no idea if these builds are recent enough to run on current macOS versions. Apple breaks compatibility quicker than you can say "obsolescence". This is one of the several reasons why hardly anybody in my computer related hobbies cares for macOS.
 
Reactions: Nothingness
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |