Question CPUs for shared memory parallel computing

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Hermetian

Member
Sep 1, 2024
71
54
46
frostconcepts.org
If there is a "Trial" and example software to test, you could post in this sub-forum. Users have many different CPU and RAM configurations that would likely be willing to run on various hardware.

Other options, besides regular commercial cloud cpu that you might be able to get run-time stats from specific cpu / ram combinations include places like:



I wonder if a large LL3 cache would not speed up this kind of task some or a bunch? Either 3dX AMD or an Intel large unified cache cpu? Might require few threads per task?
Thank you for your thoughts. I'm grateful to have input from others because it is tedious working alone.

There is a well-defined difference between parallel and distributed computing -- at least among high-performance computing professionals. The application being discussed is designed for parallel, not distributed computing. Licensing issues are a major road block to the latter.

I've been looking at the cache issue and agree. Each time the program performs a regex search on a chromosome String of size ~20 MB, I watch latencies of 0.01 sec occur at regular intervals.

I've been spoiled in the past by supercomputers with 16+ processors on the same memory backplane and consequently the same number of memory channels. At present I have 1 processor with 10 cores and 2 memory channels. Optimizing for shortest run time is a new challenge for me.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,054
15,195
136
There is a well-defined difference between parallel and distributed computing -- at least among high-performance computing professionals. The application being discussed is designed for parallel, not distributed computing.
I beg to differ. One example(there are many) prinegrid can use up to 256 processors at the same time for one job. But for best efficiency due to the design of Zen 4 CPUs as an example, we chose to use only 8 or 16 per task. We use the term distributed computing possibly loosely as the forum is named that. There are MANY applications that work the same way in BIONC, the main application that we use to access and manage tasks in other applications.

And I do believe that we are high performance computing professionals. Many of use have literally hundreds to thousands of cores in our house (I have 1800). If that is not high performance I don't know what is.
 

cellarnoise

Senior member
Mar 22, 2017
744
403
136
Thank you for your thoughts. I'm grateful to have input from others because it is tedious working alone.

There is a well-defined difference between parallel and distributed computing -- at least among high-performance computing professionals. The application being discussed is designed for parallel, not distributed computing. Licensing issues are a major road block to the latter.

I've been looking at the cache issue and agree. Each time the program performs a regex search on a chromosome String of size ~20 MB, I watch latencies of 0.01 sec occur at regular intervals.

I've been spoiled in the past by supercomputers with 16+ processors on the same memory backplane and consequently the same number of memory channels. At present I have 1 processor with 10 cores and 2 memory channels. Optimizing for shortest run time is a new challenge for me.
@
Thank you for your thoughts. I'm grateful to have input from others because it is tedious working alone.

There is a well-defined difference between parallel and distributed computing -- at least among high-performance computing professionals. The application being discussed is designed for parallel, not distributed computing. Licensing issues are a major road block to the latter.

I've been looking at the cache issue and agree. Each time the program performs a regex search on a chromosome String of size ~20 MB, I watch latencies of 0.01 sec occur at regular intervals.

I've been spoiled in the past by supercomputers with 16+ processors on the same memory backplane and consequently the same number of memory channels. At present I have 1 processor with 10 cores and 2 memory channels. Optimizing for shortest run time is a new challenge for me.
@StefanR5R , might be able to help in this what may be limited example? I think a large ll3 cache cpu than you could "pin" tasks on might help a great deal, or like you are chasing faster RAM and larger ? amount of RAM might help. If your software has datasets that are going into even SSD disk space, the options for faster are few Good Luck!

Maybe the now defunct line? I think there are not new products like this coming out, but I have not tracked it... 3DXPoint-based products... Would be good to be able to use?
 
Reactions: Hermetian

cellarnoise

Senior member
Mar 22, 2017
744
403
136
No need to beg 🙂.
There is also a well defined difference between high-performance hardware and high-performance computing. There's a journal by the latter name of which Jack Dongarra is the editor in chief.
I'm of the kind that likes more cores.. but there are so many different use cases anymore. I'll ping @StefanR5R one more time, but I really fear that I may be ended soon. @StefanR5R knows science and multi-threaded sheet better than anyone on here that has spoken up anyway. My day is likely done to some degree! ... Dead already
 

cellarnoise

Senior member
Mar 22, 2017
744
403
136
AMD current generation of cpus, only ... (Zen 3,4, 5 of the true P cores)
"only" have 32 mb of ll3 cache, on the full zen4 and zen5 chiplets. Only 8 cores per 32 mb chiplets and that is that. That is a limit that has to be taken into account on any workload.
 
Reactions: Hermetian

Hermetian

Member
Sep 1, 2024
71
54
46
frostconcepts.org
... If your software has datasets that are going into even SSD disk space, the options for faster are few Good Luck!
Thank you!
The machine I'd like to use is currently $5.4M (U.S.). What I could afford is up to $5k. So instead I work on optimizing the code and maxing out the memory in my current i9 10-core system. I'm accustom to runs of days and weeks with highly optimized codes. What I'd like to do is work on some problems that would currently take a year.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,054
15,195
136
Thank you!
The machine I'd like to use is currently $5.4M (U.S.). What I could afford is up to $5k. So instead I work on optimizing the code and maxing out the memory in my current i9 10-core system. I'm accustom to runs of days and weeks with highly optimized codes. What I'd like to do is work on some problems that would currently take a year.
And 128 cores and 12 channels of memory(192 gig) in one socket would not help you ???? Thats what I have 9 of.

Edit: and one of those systems can be had for about $5,000, your budget.
 
Last edited:

Hermetian

Member
Sep 1, 2024
71
54
46
frostconcepts.org
And 128 cores and 12 channels of memory(192 gig) in one socket would not help you ??
If I was a frequent Oracle user, I'd buy one of those boxes for dedicated service. It would run faster there (or even on a multicore laptop) than my $5.4M wannabe machine.

But I'm not so sure about my chromosome analysis code on your box. Please provide a link to it on the manufacturer's site or a robust dealer site.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,054
15,195
136
If I was a frequent Oracle user, I'd buy one of those boxes for dedicated service. It would run faster there (or even on a multicore laptop) than my $5.4M wannabe machine.

But I'm not so sure about my chromosome analysis code on your box. Please provide a link to it on the manufacturer's site or a robust dealer site.
Its DIY, you have to build it. Here is a link to the CPU


Motherboard


ram (12 of these, but if you search you can get it as low as $72 a stick, what I paid)


Any NVME you want, here is a 2 TB one


and Windows is about $140, yes it works fine on this system, I have 4 of them on win 10 pro.

Thats $5,368, just need a case and a heatsink and PSU. If you are interested, I am sure we can all recommend them. About $500 more. (win 10 in that number)

edit: I used $100 each on the ram sticks, I can find it for less later if you are interested.
 
Last edited:

Hermetian

Member
Sep 1, 2024
71
54
46
frostconcepts.org
Its DIY, you have to build it. Here is a link to the CPU
Here's the link to the AMD 9554:

It has 64 cores, max boost of 3.75 GHz, 256 MB L3, and 12 memory channels. That's 5 cores per channel -- same as I currently have.

I wouldn't have to build a system, there's at least 3 brands out there that provide complete systems in a box with an option for a liquid cooler. In addition I could purchase an extended repair warranty. The personal bonus for me is that I'd rather work in my garden of 36 different fruit trees than build and maintain computer hardware.

Thanks for the info, but I'm holding out for something with 5+GHz, larger cache, and a better core/channel ratio. My wannabe machine has a ratio of 1.
 
Jul 27, 2020
19,613
13,476
146
Please provide a link to it on the manufacturer's site or a robust dealer site.
I understand that you may not be willing to get into the adventure of building a system yourself. There's this company that may help you build something to your liking and you could get even a Threadripper system from them: https://www.pugetsystems.com/solutions/

The base configuration of TR is $4500 here: https://superworkstations.com/products/lenovo-thinkstation-p8/

You may be able to get much better price if you talk directly to Lenovo support.
 
Jul 27, 2020
19,613
13,476
146
So that's your dream CPU? It has only 8 channels, though.

I really think you should look at the 9184X. That big a cache can come in very handy.


Plus, this is super cheap. You just gotta add RAM, a case, PSU and storage: https://www.newegg.com/tyan-s8030gm...ndled-with-1-x-amd-epyc-731/p/N82E16813151340

Dual score here: https://www.cpubenchmark.net/cpu.php?cpu=AMD+EPYC+7373X&id=6099&cpuCount=2

Just halve that and you get an idea of what you are buying. It has 768MB cache too.Seems like the most cost effective deal for you, with 12 channels.
 
Jul 27, 2020
19,613
13,476
146
Here's the AMD 9184X specs. Note the max all-core boost
It might seem low but it is higher performance per clock than your 10900K, plus if your workload fits inside the massive cache and the cache is able to fulfill your workload's demands before the next set of data has to be pulled from system RAM, it will lead to fewer idle CPU cycles and you will get more performance out of it than some 5 GHz CPU with lesser cache. The majority of a CPU's life is spent just twiddling its thumbs waiting for data to arrive.
 

Hermetian

Member
Sep 1, 2024
71
54
46
frostconcepts.org
It might seem low but it is higher performance per clock than your 10900K,
My i9-10900KF routinely runs all 10 cores synchronized at 5.2 GHz. It's cache though is low at 20 MB. My SSD though is optimized.
The majority of a CPU's life is spent just twiddling its thumbs waiting for data to arrive.
A functional programming model and compiler can help with that.
 
Jul 27, 2020
19,613
13,476
146
My i9-10900KF routinely runs all 10 cores synchronized at 5.2 GHz. It's cache though is low at 20 MB. My SSD though is optimized.


Your 10900K "may" perform a bit higher than the one in the above link coz that's an average score. But there is no way that your 5.2 GHz CPU can beat even a poorly configured 9184X. This is due to better architecture (where the CPU design engineers eliminate bottlenecks to maximize throughput).

I hope you don't think that anyone here is trying to prove you wrong or anything. I have a sincere intention to help you realize that you are on a pretty old CPU generation and things have improved radically since then. Even Intel's Alder Lake (12900K) is a massive improvement over your CPU. I'm unfortunately getting the impression that you are basing your assumptions on experience with your own CPU which is way outdated in 2024. Something modern running at 4.5 GHz can easily run circles around your 5.2 GHz cores.
 
Reactions: Hermetian

naukkis

Senior member
Jun 5, 2002
871
737
136
My i9-10900KF routinely runs all 10 cores synchronized at 5.2 GHz. It's cache though is low at 20 MB. My SSD though is optimized.

A functional programming model and compiler can help with that.
Actually 10900 runs it's ring asyncronously. Have you overclocked ring to 5.2ghz too, usually that won't be possible?(Stock ring is ~4.3ghz or so) Amd 8-core cpus instead will run ring @fastest core clock resulting lower core to core latencies. But those massive cache epycs recommended here are chiplet-based with terrible core to core latencies, if that matters those aren't feasible at all.
 
Jul 27, 2020
19,613
13,476
146
... if the load being tested is a standard application. But I'm not running a database, engaged in AI, or trying to profit from bit-coin mining. I am spawning parallel computations on cores of the same processor.
True. But AMD has put a lot of effort in making sure that their CPUs accelerate all kinds of workloads. I can even let you access my 128 thread eight channel Epyc over Anydesk or any other remote software of your choice so can spend up to a whole day if you want, testing and benchmarking to your heart's content. I would be very interested in your findings. Just let me know beforehand so we can schedule it while I'm at home so I can assist if there are any connectivity issues or anything else that I may need to do at my end.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,054
15,195
136
Frontier is the worlds fastest supercomputer. Note below:

Frontier uses 9,472 AMD Epyc 7713 "Trento" 64 core 2 GHz CPUs (606,208 cores) and 37,888 Instinct MI250X GPUs (8,335,360 cores). They can perform double-precision operations at the same speed as single precision.

GHZ is not the best way to have processing power. while it is watercooled, I doubt that they could have even overclocked it as much as a stock 9554.

specs of 7713

NameAMD EPYCâ„¢ 7713
FamilyEPYC
SeriesEPYC 7003 Series
Form FactorServers
# of CPU Cores64
# of Threads128
Max. Boost Clock Up to 3.67 GHz
Base Clock2 GHz
L3 Cache256 MB
1kU Pricing7060 USD
Default TDP225W
AMD Configurable TDP (cTDP)225-240W
CPU SocketSP3
Socket Count1P / 2P
Launch Date03/15/2021

specs of the 9554 THAT I SUGGESTED

NameAMD EPYCâ„¢ 9554
FamilyEPYC
SeriesEPYC 9004 Series
Form FactorServers
Regional AvailabilityGlobal , China , NA , EMEA , APJ , LATAM
# of CPU Cores64
# of Threads128
Max. Boost Clock Up to 3.75 GHz
All Core Boost Speed 3.75 GHz
Base Clock3.1 GHz
L3 Cache256 MB
1kU Pricing9087 USD
Default TDP360W
AMD Configurable TDP (cTDP)320-400W
CPU SocketSP5
Socket Count1P / 2P
Launch Date11/10/2022
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |