Question CPUs for shared memory parallel computing

Page 8 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

StefanR5R

Elite Member
Dec 10, 2016
6,057
9,106
136
While "% CPU load" shows whether threads are busy or are sleeping (e.g. waiting for disk I/O; waiting on semaphores for inter-thread synchronization; ...), it does unfortunately not show how the CPU time is spent (e.g. computing; chasing pointers; trying to read from or to write to main memory, perhaps even while other threads are trying the same but at other addresses; busy-waiting on a spinlock but that seems unlikely to be relevant to your application, ...).
 
Reactions: Hermitian

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,410
4,175
75
For a product similar to Mathematica, look at Maple license cost. Are they overcharging too?

For people who need that kind of products, they're worth every penny spent.
@Hermetian I'm not very familiar with Mathematica, but looking at your code, I don't see anything very Mathematica-specific. Nothing about math function manipulation or integral approximations or anything like that. It just looks like string manipulations with regular expressions.

I really think you should look into doing this in Python instead. Python is a free (as in beer and speech) language which is very popular with scientists, among others. I'd say it's at least as popular with scientists as Fortran was back in the day, but much easier to learn and use. It tends to be one of the faster high-level languages, and you don't have to pay for unlimited threads. (I think that's the same as "kernels".) It also has libraries available for many things. I found one called BioPython that seems up your alley, though I don't know if it applies to this particular project.

You seem familiar with Mathematica syntax. If you're generally familiar with procedural programming languages, you might find it instructive to skim this page for an idea of Python's syntax. I didn't look deeper to see if the other resources are useful.

The main different things about Python (which I find annoying) are:
- Grouping instructions, e.g. for an if or a loop, is done by a colon followed by indenting following lines. I see Mathematica does this with [square brackets]. (Most languages do this with {curly braces}; some say "end" at the end of a block instead.) Python's method also means you can't auto-indent - since the indent is integral you have to manually keep track of it yourself.
- For loops only apply to lists, like "foreach" in other languages. You can generate a numeric list with range() to work like most languages' for loops. But I don't see a for loop in your code so you may not care.
- There is usually one canonical, "Pythonic", socially acceptable way to do any given thing you want to do. I find this restrictive and overbearing, but I guess it makes reading other people's code easier.
 
Reactions: Sgraffite

Nothingness

Diamond Member
Jul 3, 2013
3,137
2,153
136
My experience with Python is just plain horrible from a performance point of view. I've seen so many prototypes written in Python end up in production that are so slow that they need a full rewrite (in Python or other languages).

I guess many people who find Python performance good enough don't realize that's only the case because most of the time is spent in external native libraries.

I don't think going from Mathematica to Python is the way to go to really gain speed. It might be an easier path than going lower level (C/C++/Rust/Go), but that'd be completely wasted time if the result is slower or only moderately faster.

Python has its uses (I developed scripts to generate code for our simulator based on textual specifications). But for performance, I will never recommend it.
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
looking at your code, I don't see anything very Mathematica-specific. Nothing about math function manipulation.
Mathematics is the language of measurement. College calculus is a small part of its realm.

The algorithm I've implemented here is Variable Width Discrete Correlation. It is an application of signal processing to noisy data. In this case the signals are cell chromosomes.
i really think you should look into doing this in Python instead.
Thank you for the suggestion. I'm familiar with Python and many other languages, beginning with octal and Algol. Python is inherently slow and poorly supported in terms of efficient, error-free libraries. It does support a functional programming model but it doesn't compile any better than sequential style. The best compiler I've had experience with is Cray Fortran. Wolfram is a close second, and comes in first in terms of functionality.
 
Reactions: Nothingness

Cogman

Lifer
Sep 19, 2000
10,283
134
106
@Hermetian
Are you using any of the mathmatics capabilities of mathmatica at this point or mainly regex processing?

I ask because if the licensing for kernel cores is starting to be a limiting factor you might be better off into a language without such limitation.

I'd suggest Rust or C++ as this is a CPU bound low level task, but honestly Java/Kotlin wouldn't be terrible at this sort of task.

@Ken g6 's python suggestion would be OK, but you'd need to be careful of the global interpreter lock for highly threaded computational work.

Kotlin/Java/Rust/C++ would all be decent choices as they have pretty nice functional APIs built into them. I would personally start with Java for a highly concurrent thing as dealing with memory management can be a bit of a pain with concurrent workload.

Here's an example of how you can do parallel processing of a regex against a bunch of strings

Java:
record Result(Matcher regexMatcher, String pattern){};

public Stream<Result> evalRegex(Stream<String> dnaSequence, String regex) {
    var pattern = Pattern.compile(regex);
    return dnaSeqence.parallel()
        .map((dna)->new Result(pattern.matcher(dna), dna))
        .filter((r)->r.regexMatcher().matches())
}

That above code will fan out the "map" function onto as many CPU cores as you have available into a work stealing queue. It lazy evaluates the stream (so you don't actually perform any actions until you run something like `.collect()` on the stream) which allows you to queue up all the operations you'd want to.

Java also has some pretty nice "Reactive" projects (like reactor) which would allow you to progressively process a file and set of regexes without holding everything in memory.

Java will be a bit more memory intense than what you'd get with Rust/C++, but it will also be nearly as fast (for concurrent stuff, likely faster).
 
Last edited:
Reactions: Nothingness

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
While "% CPU load" shows whether threads are busy or are sleeping, it does unfortunately not show how the CPU time is spent.
I agree. The point in monitoring it is two-fold: checking that usage is scaling linearly with number of parallel processes, and estimating the number of processes that would induce swapping.

Here what I'm noticing the processes are scaling fine, but each could be computing a lot more data.
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
Are you using any of the mathematics capabilities of Mathematica at this point or mainly regex processing?
In this application I am using regex to generate thousands of candidate matches which are then assayed by mathematical criteria. When written in a functional programming model, Mathematica produces an efficient computation.

At present I am focused on where to "slice" the data for parallel processing to get the best use of my present computational hardware. The input data are of size:
targets x queries x regex/query.
The standard targets are strings of length 4MB to 40MB. They number 8 to 32.
The standard queries are strings of length 16 to 32. They number 16 to 48.
The regex/query are fixed length. They usually vary in number from 40 to 60, depending on the query length.

The performance charts I've recently posted are from parallelization across the regex, which includes assay of the results.

Here's an example of how you can do parallel processing of a regex against a bunch of strings
If speed were the only issue I would program in Assembler. It harkens back to the days when we wrote graphics in raw postscript!
 

Cogman

Lifer
Sep 19, 2000
10,283
134
106
which are then assayed by mathematical criteria.
You could split this into 2 applications, 1 which generates the candidate matches and a second that runs the mathematical criteria. That might give you the best of both worlds.

If speed were the only issue I would program in Assembler. It harkens back to the days when we wrote graphics in raw postscript!
I'd STRONGLY encourage you not to do this. Modern compilers and JITs are really good. You are not only unlikely to beat them in terms of performance, you'll likely make things slower.

There are exceptions, to be sure, but you should consider those the exception and not the rule. What compilers and JITs can do is really quiet magical. So much so that it's actually hard to effectively microbenchmark. They'll pretty commonly do things like collapsing a for loop into the end result or dropping unreachable conditionals. Things that'd be really hard to do as a programmer doing assembler.

You also might be pretty surprised at how ergonomic languages like Java and C++ have become. Both have made quiet amazing jumps in readability.

Here's a fun toy to play with if you ever get the time.

 
Reactions: Nothingness

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,410
4,175
75
If I understand the application correctly, you're searching the 200m-ish bases of chromosome 4 for partial matches to a 22-character string? This strongly reminds me of text searching algorithms I've used for data compression.

The trick would seem to be "partial" matching. If the maximum number of errors you're looking for is 4, enumerating those partial matches doesn't seem out of the question. Implementing Rabin Karp hashing for the data file, while the size gives me pause, seems plausible, though RAM use would be a concern. (Does Java allow addressing more than 4GB these days?) Then it would just be a matter of searching those enumerated partial matches in the hash table.
 
Reactions: Hermitian

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
If I understand the application correctly, you're searching the 200m-ish bases of chromosome 4 for partial matches to a 22-character string?
In that example, yes - although 20M bases.
The trick would seem to be "partial" matching.
Yes. Now if these are required to be the same length as the query string (e g. 22 characters) then this can be quickly done with regular discrete correlation, esp. since it is the foundation of string searches in modern processors. However, here the "matches" are allowed to vary in length. For example, a couple of the errors might be due to missing characters in the sequence, or likewise due to a few inserted characters.
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
Here's the results of timing tests over a large range of problem sizes.

Note how poorly my current parallelization approach is for target (the strings to be searched) sizes less than about 70kB.

Also notice the hiccup that occurs at 900kB. It appears to be hardware related.

 

Cogman

Lifer
Sep 19, 2000
10,283
134
106
Does Java allow addressing more than 4GB these days
Yes and no.

You can't make a byte[] that stores more than 2^31 bytes (2GB) because of integer indexing. You can, however, store exabytes of data. Java also has a preview feature inbound called "MemorySegment" that allows for up to 2^63 bit compact allocations.

In the case of something like this, there's nothing that prevents you from having 2^31 "String" objects which themselves can contain up to 2gb of characters.


The preview is blocked on "Project Valhalla" which is java introducing flat objects into the JVM (Very exciting for computational stuff). Basically, at the moment all objects in java are represented as pointers to the data/fields. Valhalla is introducing "value objects" which ultimately represent objects as the fields themselves rather than pointers to the field.

So an array of `record Point(int x, int y){};` today looks like `[pointer -> x, y, pointer -> x, y, pointer ->x, y]` When valhalla hits that changes to `value record Point(int x, int y){};` which is represented as `[x,y, x, y, x, y]`. There's a bit more to it than that... but yeah, really exciting stuff for high speed computing in the JVM.

I should note that the pointer problem for the JVM generally isn't a huge issue due to how the GC algorithms work. By the nature of how GC works, you can reasonably assume that your `Point[]` pointers will all end up in the same locality, which helps a ton with cache misses. However, there's still and extra lookup and bloat due to the object headers.

If you can't tell, I'm really excited about Valhalla. It will actually make the JVM fairly competitive with C/C++ in terms of performance.
 
Reactions: Hermitian

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
I've implemented an option to switch between 4 different "levels" of parallelism: none, by targets, by queries, or by reflexes (original). Next I'll examine how these options scale with size.

 
Jul 27, 2020
20,917
14,492
146
It's 13k for a 8P+2E cores vs 6k for an i7 with 4 cores, so I'm not sure the advantage is that large.
Things may have improved considerably since then and especially by the time he decides to get a Mac, he may be able to snag an M4 device.

I was kinda taken aback by my M1 MBA last night scoring just 7 seconds on battery power in overclock.net Excel benchmark sheet using Librecalc Apple Silicon version. Ryzen 9950X scores about 11 seconds first try and roughly 9 seconds on the second try. Will try again tonight to see what the 2nd run time is on the M1.
 

Nothingness

Diamond Member
Jul 3, 2013
3,137
2,153
136
Things may have improved considerably since then and especially by the time he decides to get a Mac, he may be able to snag an M4 device.
Yes it should be quite competitive... but quite expensive too!

I was kinda taken aback by my M1 MBA last night scoring just 7 seconds on battery power in overclock.net Excel benchmark sheet using Librecalc Apple Silicon version. Ryzen 9950X scores about 11 seconds first try and roughly 9 seconds on the second try. Will try again tonight to see what the 2nd run time is on the M1.
Oh I didn't know you resurrected your Mac, nice 😀
As i already said I'm very pleased by the performance of my MBP M1 Max. Can't wait to get an M4.
 
Reactions: igor_kavinski
Jul 27, 2020
20,917
14,492
146
Oh I didn't know you resurrected your Mac, nice 😀
Thanks to your charging tip

I think previously I was trying to charge it for 2 or 3 hours max and that wasn't cutting it. Annoyingly there doesn't seem to be a charging indication LED (or I'm too blind to see it) so that was giving me doubts if it was even getting charged with my mobile quick charger.

Gonna try Wine with a few applications before deciding if I want to get a Snapdragon or an M2 laptop. Snapdragon would give me 32GB RAM so kinda leaning towards that.
 
Reactions: Nothingness

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
@Nothingness
I have finished the benchmark tests of the improved algorithm on my memory-upgraded system. Here are some takeaways.

Overall improvement in runtimes: from weeks to days.

Overall paradigm shift: prior to memory upgrade, the most efficient approach was to parallelize the innermost loop. After the memory upgrade, the most efficient approach is to parallelize the outermost loop (previously not possible). Implementation of the built-in SequenceAlignment function provided further speed-up.

Actual runtime on "standard" problem: 4.7 days for queries of 8 markers on an Apple genome library.

Estimated runtime of "wannabe" problem of 1200 markers on 12 Tea genomes: 36.6 weeks. Still a bit too long, but getting closer. 🙂
 
Reactions: Nothingness

LightningZ71

Golden Member
Mar 10, 2017
1,910
2,260
136
@Nothingness
I have finished the benchmark tests of the improved algorithm on my memory-upgraded system. Here are some takeaways.

Overall improvement in runtimes: from weeks to days.

Overall paradigm shift: prior to memory upgrade, the most efficient approach was to parallelize the innermost loop. After the memory upgrade, the most efficient approach is to parallelize the outermost loop (previously not possible). Implementation of the built-in SequenceAlignment function provided further speed-up.

Actual runtime on "standard" problem: 4.7 days for queries of 8 markers on an Apple genome library.

Estimated runtime of "wannabe" problem of 1200 markers on 12 Tea genomes: 36.6 weeks. Still a bit too long, but getting closer. 🙂
While running the tests, did you do any system profiling? Was it effectively keeping all cores busy? How busy was RAM? Did you notice anything that appeared to be a limiting factor?
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
While running the tests, did you do any system profiling? Was it effectively keeping all cores busy? How busy was RAM? Did you notice anything that appeared to be a limiting factor?
Yes.
My hardware has 10 cores capable of two threads each. The CPU is overclocked, typically achieving 5.12 GHz under heavy loads. The board is liquid cooled.
My Windows O/S is set to prioritize applications.
I instructed Mathematica to run 12 parallel kernels, which results in 12 computational subprocesses plus 2 coordinating subprocesses. Each computational subprocess accounted for about 6% of available CPU load until its task was finished. Memory usage rarely exceeded 20% and is no longer a bottleneck. In conjunction with this, the cores ran about 4°C cooler. However, I believe the number of memory channels (2) are still limiting runtime. Keep in mind I've been historically spoiled by supercomputers with hundreds of cores all on the same memory backplane with 1:1 channel to core ratios. Additionally, the number of cores in my system and number of parallel processes permitted by my Mathematica license inhibit faster times to completion -- although the memory channel issue restricts these gains in a less than linear fashion.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |