Question CPUs for shared memory parallel computing

StefanR5R · Tuesday at 8:10 AM

While "% CPU load" shows whether threads are busy or are sleeping (e.g. waiting for disk I/O; waiting on semaphores for inter-thread synchronization; ...), it does unfortunately not show how the CPU time is spent (e.g. computing; chasing pointers; trying to read from or to write to main memory, perhaps even while other threads are trying the same but at other addresses; busy-waiting on a spinlock but that seems unlikely to be relevant to your application, ...).

Ken g6 · Tuesday at 1:41 PM

Nothingness said:
For a product similar to Mathematica, look at Maple license cost. Are they overcharging too?

For people who need that kind of products, they're worth every penny spent.

@Hermetian I'm not very familiar with Mathematica, but looking at your code, I don't see anything very Mathematica-specific. Nothing about math function manipulation or integral approximations or anything like that. It just looks like string manipulations with regular expressions.

I really think you should look into doing this in Python instead. Python is a free (as in beer and speech) language which is very popular with scientists, among others. I'd say it's at least as popular with scientists as Fortran was back in the day, but much easier to learn and use. It tends to be one of the faster high-level languages, and you don't have to pay for unlimited threads. (I think that's the same as "kernels".) It also has libraries available for many things. I found one called BioPython that seems up your alley, though I don't know if it applies to this particular project.

You seem familiar with Mathematica syntax. If you're generally familiar with procedural programming languages, you might find it instructive to skim this page for an idea of Python's syntax. I didn't look deeper to see if the other resources are useful.

The main different things about Python (which I find annoying) are:
- Grouping instructions, e.g. for an if or a loop, is done by a colon followed by indenting following lines. I see Mathematica does this with [square brackets]. (Most languages do this with {curly braces}; some say "end" at the end of a block instead.) Python's method also means you can't auto-indent - since the indent is integral you have to manually keep track of it yourself.
- For loops only apply to lists, like "foreach" in other languages. You can generate a numeric list with range() to work like most languages' for loops. But I don't see a for loop in your code so you may not care.
- There is usually one canonical, "Pythonic", socially acceptable way to do any given thing you want to do. I find this restrictive and overbearing, but I guess it makes reading other people's code easier.

Nothingness · Tuesday at 3:07 PM

My experience with Python is just plain horrible from a performance point of view. I've seen so many prototypes written in Python end up in production that are so slow that they need a full rewrite (in Python or other languages).

I guess many people who find Python performance good enough don't realize that's only the case because most of the time is spent in external native libraries.

I don't think going from Mathematica to Python is the way to go to really gain speed. It might be an easier path than going lower level (C/C++/Rust/Go), but that'd be completely wasted time if the result is slower or only moderately faster.

Python has its uses (I developed scripts to generate code for our simulator based on textual specifications). But for performance, I will never recommend it.

Hermetian · Tuesday at 3:14 PM

Ken g6 said:
looking at your code, I don't see anything very Mathematica-specific. Nothing about math function manipulation.

Mathematics is the language of measurement. College calculus is a small part of its realm.

The algorithm I've implemented here is Variable Width Discrete Correlation. It is an application of signal processing to noisy data. In this case the signals are cell chromosomes.

Ken g6 said:
i really think you should look into doing this in Python instead.

Thank you for the suggestion. I'm familiar with Python and many other languages, beginning with octal and Algol. Python is inherently slow and poorly supported in terms of efficient, error-free libraries. It does support a functional programming model but it doesn't compile any better than sequential style. The best compiler I've had experience with is Cray Fortran. Wolfram is a close second, and comes in first in terms of functionality.

Cogman · Tuesday at 3:21 PM

@Hermetian
Are you using any of the mathmatics capabilities of mathmatica at this point or mainly regex processing?

I ask because if the licensing for kernel cores is starting to be a limiting factor you might be better off into a language without such limitation.

I'd suggest Rust or C++ as this is a CPU bound low level task, but honestly Java/Kotlin wouldn't be terrible at this sort of task.

@Ken g6 's python suggestion would be OK, but you'd need to be careful of the global interpreter lock for highly threaded computational work.

Kotlin/Java/Rust/C++ would all be decent choices as they have pretty nice functional APIs built into them. I would personally start with Java for a highly concurrent thing as dealing with memory management can be a bit of a pain with concurrent workload.

Here's an example of how you can do parallel processing of a regex against a bunch of strings

Java:

record Result(Matcher regexMatcher, String pattern){};

public Stream<Result> evalRegex(Stream<String> dnaSequence, String regex) {
    var pattern = Pattern.compile(regex);
    return dnaSeqence.parallel()
        .map((dna)->new Result(pattern.matcher(dna), dna))
        .filter((r)->r.regexMatcher().matches())
}

That above code will fan out the "map" function onto as many CPU cores as you have available into a work stealing queue. It lazy evaluates the stream (so you don't actually perform any actions until you run something like `.collect()` on the stream) which allows you to queue up all the operations you'd want to.

Java also has some pretty nice "Reactive" projects (like reactor) which would allow you to progressively process a file and set of regexes without holding everything in memory.

Java will be a bit more memory intense than what you'd get with Rust/C++, but it will also be nearly as fast (for concurrent stuff, likely faster).

Hermetian · Tuesday at 3:29 PM

StefanR5R said:
While "% CPU load" shows whether threads are busy or are sleeping, it does unfortunately not show how the CPU time is spent.

I agree. The point in monitoring it is two-fold: checking that usage is scaling linearly with number of parallel processes, and estimating the number of processes that would induce swapping.

Here what I'm noticing the processes are scaling fine, but each could be computing a lot more data.

Hermetian · Tuesday at 4:06 PM

Cogman said:
Are you using any of the mathematics capabilities of Mathematica at this point or mainly regex processing?

In this application I am using regex to generate thousands of candidate matches which are then assayed by mathematical criteria. When written in a functional programming model, Mathematica produces an efficient computation.

At present I am focused on where to "slice" the data for parallel processing to get the best use of my present computational hardware. The input data are of size:
targets x queries x regex/query.
The standard targets are strings of length 4MB to 40MB. They number 8 to 32.
The standard queries are strings of length 16 to 32. They number 16 to 48.
The regex/query are fixed length. They usually vary in number from 40 to 60, depending on the query length.

The performance charts I've recently posted are from parallelization across the regex, which includes assay of the results.

Cogman said:
Here's an example of how you can do parallel processing of a regex against a bunch of strings

If speed were the only issue I would program in Assembler. It harkens back to the days when we wrote graphics in raw postscript!

Cogman · Tuesday at 4:24 PM

Hermetian said:
which are then assayed by mathematical criteria.

You could split this into 2 applications, 1 which generates the candidate matches and a second that runs the mathematical criteria. That might give you the best of both worlds.

Hermetian said:
If speed were the only issue I would program in Assembler. It harkens back to the days when we wrote graphics in raw postscript!

I'd STRONGLY encourage you not to do this. Modern compilers and JITs are really good. You are not only unlikely to beat them in terms of performance, you'll likely make things slower.

There are exceptions, to be sure, but you should consider those the exception and not the rule. What compilers and JITs can do is really quiet magical. So much so that it's actually hard to effectively microbenchmark. They'll pretty commonly do things like collapsing a for loop into the end result or dropping unreachable conditionals. Things that'd be really hard to do as a programmer doing assembler.

You also might be pretty surprised at how ergonomic languages like Java and C++ have become. Both have made quiet amazing jumps in readability.

Here's a fun toy to play with if you ever get the time.

Data Analytics With Kotlin Notebooks, DataFrame, and Kandy | The Kotlin Blog

Interactively exploring and visualizing data allows users to uncover hidden patterns, trends, and relationships that may not be immediately apparent through static analysis. This instant feedback allows users to quickly validate their assumptions, identify outliers or anomalies, and adjust their...

blog.jetbrains.com

Ken g6 · Tuesday at 6:14 PM

If I understand the application correctly, you're searching the 200m-ish bases of chromosome 4 for partial matches to a 22-character string? This strongly reminds me of text searching algorithms I've used for data compression.

The trick would seem to be "partial" matching. If the maximum number of errors you're looking for is 4, enumerating those partial matches doesn't seem out of the question. Implementing Rabin Karp hashing for the data file, while the size gives me pause, seems plausible, though RAM use would be a concern. (Does Java allow addressing more than 4GB these days?) Then it would just be a matter of searching those enumerated partial matches in the hash table.

Hermetian · Tuesday at 7:59 PM

Ken g6 said:
If I understand the application correctly, you're searching the 200m-ish bases of chromosome 4 for partial matches to a 22-character string?

In that example, yes - although 20M bases.

Ken g6 said:
The trick would seem to be "partial" matching.

Yes. Now if these are required to be the same length as the query string (e g. 22 characters) then this can be quickly done with regular discrete correlation, esp. since it is the foundation of string searches in modern processors. However, here the "matches" are allowed to vary in length. For example, a couple of the errors might be due to missing characters in the sequence, or likewise due to a few inserted characters.

Hermetian · Tuesday at 8:08 PM

Here's the results of timing tests over a large range of problem sizes.

Note how poorly my current parallelization approach is for target (the strings to be searched) sizes less than about 70kB.

Also notice the hiccup that occurs at 900kB. It appears to be hardware related.

Cogman · 2024-09-18T12:00:11-0400

Ken g6 said:
Does Java allow addressing more than 4GB these days

Yes and no.

You can't make a byte[] that stores more than 2^31 bytes (2GB) because of integer indexing. You can, however, store exabytes of data. Java also has a preview feature inbound called "MemorySegment" that allows for up to 2^63 bit compact allocations.

In the case of something like this, there's nothing that prevents you from having 2^31 "String" objects which themselves can contain up to 2gb of characters.

MemorySegment (Java SE 21 & JDK 21)

declaration: module: java.base, package: java.lang.foreign, interface: MemorySegment

docs.oracle.com

The preview is blocked on "Project Valhalla" which is java introducing flat objects into the JVM (Very exciting for computational stuff). Basically, at the moment all objects in java are represented as pointers to the data/fields. Valhalla is introducing "value objects" which ultimately represent objects as the fields themselves rather than pointers to the field.

So an array of `record Point(int x, int y){};` today looks like `[pointer -> x, y, pointer -> x, y, pointer ->x, y]` When valhalla hits that changes to `value record Point(int x, int y){};` which is represented as `[x,y, x, y, x, y]`. There's a bit more to it than that... but yeah, really exciting stuff for high speed computing in the JVM.

I should note that the pointer problem for the JVM generally isn't a huge issue due to how the GC algorithms work. By the nature of how GC works, you can reasonably assume that your `Point[]` pointers will all end up in the same locality, which helps a ton with cache misses. However, there's still and extra lookup and bloat due to the object headers.

If you can't tell, I'm really excited about Valhalla. It will actually make the JVM fairly competitive with C/C++ in terms of performance.

Question CPUs for shared memory parallel computing

StefanR5R

Elite Member

Ken g6

Programming Moderator, Elite Member

Nothingness

Diamond Member

Hermetian

Member

Cogman

Lifer

Hermetian

Member

Hermetian

Member

Cogman

Lifer

Data Analytics With Kotlin Notebooks, DataFrame, and Kandy | The Kotlin Blog

Ken g6

Programming Moderator, Elite Member

Hermetian

Member

Hermetian

Member

Cogman

Lifer

MemorySegment (Java SE 21 & JDK 21)

TRENDING THREADS