Question CPUs for shared memory parallel computing

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

MS_AT

Senior member
Jul 15, 2024
365
798
96
A Google search will provide suitable answers.
The kernel uses highly optimized multithreaded libraries such as Intel MKL and IPP, which are tuned for optimal performance and take advantage of advanced CPU features when available. This is important for vectorized machine arithmetic and numerical linear algebra (BLAS, …) routines, which are fundamental building blocks for many computational tasks.
after https://support.wolfram.com/39353?src=mathematica under Central Processing Unit.

The wording as such suggest that no SIMD is disabled, some stack overflow answers seems to suggest that MKL workaround for AMD cpus might still be needed, but these are quite old so might not be applicable anymore.

Based on my limited search I was unable to find any information that would suggest AVX is disabled.
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
If you could provide a simple example of one whole calculation, I think I could whip up a quick C program because this seemingly looks like text processing more than anything else, unless some indexed lookups are involved in there too?
You need not write any code. A number of "DNA analysis" benchmarks are available. Additionally, Mathematica has a benchmark suite which just might run under the trial version.
 

Nothingness

Diamond Member
Jul 3, 2013
3,137
2,153
136
The post-processing run time is less than 20 minutes, sometimes less than 3. It involves the assimilation of hundreds of files, vetting the contents, and then writing a single output file. I have a high-end SSD, plus I'm using a binary file format so the I/O portion is under a second.
Thanks the details. So your problem is mostly around regexp matching. I should read more about bioinformatics before thinking more about the issues at hand. Is there some freely available resources that I could read to get a better grasp at what you're doing?

Of course one could write this code in a number of other languages for compilation to a binary executable. It would require construction or acquisition of compiler-compatible libraries of set-theoretic functions; i.e. a lot of software management overhead.
Yes, going from a high-level language such as Wolfram Language and its tuned libraries would be an enormous task that would require a lot of expertise in many algorithmic fields.

On the other hand, writing highly optimized code dedicated to a specific task can do wonders. But that requires many qualified people, a luxury you likely don't have.

The regular expressions are not complex. In fact I initially tried using a single, complex regex per marker but rapidly found out that the recursion limit will be exceeded on chromosome size strings. Consequently I replaced the complex regex with many individual cases. Soon after that I realized that parallelization would be necessary.
There's no writable data shared during the parallel phase, right? And many regexps apply to the same genome data?

Sorry if I ask basic questions, I've long been interested in Wolfram Language (for more than 30 years) and bioinformatics in general without ever finding time to dig into the subjects.
 

LightningZ71

Golden Member
Mar 10, 2017
1,910
2,260
136
The amount if RAM allocated for your data structures is not necessarily the amount of live data that your program has in flight at any guven moment. Extra ram would allow more instances and would serve to help buffer disk I/O operations, so it won't likely hurt.

In general, you don't seem to have a firm handle on where you are performance limited with your current setup. Windows task manager is very general in what it tells you. It'll tell you that your processor is at near 100% utilization when it is spending over 70% of it's cycles stalked on memory I/O. We don't know if your CPU is blowing through it's computation work in moments to turn around and wait for data to be written to cache, summoned from main memory, or if it's waiting on another core to finish an operation. Before you start spending gobs of money on upgrades or a whole new system, I would spend some quality time with windows PerfMon profiling your processes and problem runs.

Start with a smaller example dataset on a single core and time a run. Add a core and do it again. Repeat. Find out how it scales and see if you hit a wall in performance scaling before you run out of cores. You may be saturating your L3 cache after only a few cores, or you may be overloading your memory controller quickly. There's a lot that mathematica can be doing behind the scenes that will impact things beyond the theoretical program model itself.
 
Jul 27, 2020
20,917
14,491
146
This actually happens when the users are too focused on doing their job. One of the funniest examples was a bunch of guys in the Telex dept of my organization who used to open huge text files in Notepad, then do a Replace on them and sit back and watch as Notepad replaced the occurrences of the text string one by one from top to bottom. Some had old PCs so it would take anywhere from 30 seconds to a minute or more for the process to finish. And they did this multiple times a day. When I saw them doing this, I told them to open the text file in Wordpad and do the replace. It finished in the blink of an eye. Now of course, I wouldn't have known that if I hadn't noticed the speed difference myself. That's why when trying to solve problems, I evaluate multiple possibilities because settling on one or the first solution may lead to wasting tons of time in the long run.
 
Reactions: Tlh97

Nothingness

Diamond Member
Jul 3, 2013
3,137
2,153
136
This actually happens when the users are too focused on doing their job. One of the funniest examples was a bunch of guys in the Telex dept of my organization who used to open huge text files in Notepad, then do a Replace on them and sit back and watch as Notepad replaced the occurrences of the text string one by one from top to bottom. Some had old PCs so it would take anywhere from 30 seconds to a minute or more for the process to finish. And they did this multiple times a day. When I saw them doing this, I told them to open the text file in Wordpad and do the replace. It finished in the blink of an eye. Now of course, I wouldn't have known that if I hadn't noticed the speed difference myself. That's why when trying to solve problems, I evaluate multiple possibilities because settling on one or the first solution may lead to wasting tons of time in the long run.
Real men do it with sed. And they have fabs!
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
Is there some freely available resources that I could read to get a better grasp at what you're doing?

There's no writable data shared during the parallel phase, right? And many regexps apply to the same genome data?

I've long been interested in Wolfram Language.
Bioinformatics is a very general term, I believe you are asking about genomics. The facts in genomics have significantly changed in the past 5 years with the introduction of the PacBio Sequel series of long read sequencers. The situation is akin to the Michelson–Morley experiment in physics.

Also, biologists have been living in mathematical denial since the 1940's in their analysis of "alleles" and especially with the analysis of PCR assay of molecular markers in celled organisms (eukaryotes). Consequently I'm hesitant recommend anything beyond basic definitions of "chromosome" and "DNA" found on https://www.genome.gov/genetics-glossary/.

The formal abbreviation for a regular expression is "regex".

I'm not sure what you mean by "writable data". The results of (and annotation) of each regex search are written to a modest (1kB to 100kB) file. As mentioned above, these rarely occur in the same second of execution across all kernels.

I've been programming with the Wolfram language for 5 years. It is likely the last of dozens I will use. I like to describe it as a cross between Fortran and Lisp. I suspect you are familiar "pipe" constructs in unix/linux shell instructions. Something akin to this is present in the Wolfram language for "piping" functions together, possibly borrowed from Scheme. For more information, read about Wolfram "pure functions" ("&" operator) plus its use with the "list" operator "@".
 
Last edited:

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
... Start with a smaller example dataset on a single core and time a run. Add a core and do it again.
Last year I made a thorough study of how run time scales with problem sizes, parallelism sizes 0 to 14, and parallelization approaches. I plan to repeat it after the memory update in a few weeks. Thank you for the pointer to PerfMon.
 
Reactions: igor_kavinski
Jul 27, 2020
20,917
14,491
146
PacBio Sequel series of long read sequencers.
Once you mentioned that, I immediately thought of GPUs and sure enough: https://www.nvidia.com/en-us/case-studies/long-read-sequencing/

It's possible that instead of a whole system upgrade, all you need is a decent GPU with enough RAM (RX 7900 XT has 20GB) and it could be worth a hundred CPU cores in performance and sequence crunching power. You would need to learn a bit of GPGPU language but since you seem to be pretty well versed with quite a few languages already, it should be a piece of cake for you. It would also be beneficial for you in breaking free from any commercial license limitations so hopefully your true productivity will finally be unlocked instead of waiting on processing for results to be available.
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
This afternoon I took a look at my hardware specs in preparation for the memory upgrade. With much chagrin I find it is my GPU that has 16 GB and the CPU has 32 GB! Regardless I've scheduled an upgrade to 128 GB of Kingston DDR4-2933.
 

Cogman

Lifer
Sep 19, 2000
10,283
134
106
Just some thoughts probably not entirely helpful for the CPU discussion (but I do this sort of stuff for my day job).
  • Regex compilation is quiet expensive, if you haven't already you should make sure you aren't recompiling the regexs over and over again. They should ideally be cached and reused.
  • Regexs ultimately translate into finite state machines. They are a nice shortcut but if you need hard logic you can often get faster by ditching the regex and hand coding the state machine. This also gives the compiler more opportunities to optimize your implementation. Regexs are somewhat opaque boxes for compilers.
  • Strings, particularly UTF strings, take a lot of space and can be quiet slow to work with. Considering the pattern you are working with includes only 4 unique characters and memory is probably a bottleneck here, you can significantly reduce the memory here by instead encoding the sequence as 2 bit blocks (warning, don't use `boolean`, you'll want to use an actual bit field or some bit splatting). Consider that even in the best case, you are talking about 8bits to represent a single DNA chemical. That means you can stuff 4 different sequence representations into a single character space.

I say some of this being unfamiliar with how mathmatica optimizes stuff, so certainly take it with a grain of salt. And my feelings aren't hurt if you ditch this advice because working code is better than hours spent writing non-working code .

But speaking about regexs being opaque boxes. Imagine the regex `/AB/`. It will ultimately translate into a state machine with 3 states, A, B, found. That's 3 different nodes (which means 3 different memory locations) which the CPU has to navigate to correctly evaluate the regex. AFAIK, most regex compilers will not optimize beyond making the state machine.

However, if instead your code looks something like `if (sequence.contains("AB"))` that opens up the compiler to make wonderful and aggressive optimizations where it just needs to visit the characters in the string and has no second memory location to load up (except for the instruction cache). A good test to run would be converting some of the simpler regexs into more direct searching logic. Just be cautious as some languages like to tuck regexs in places they shouldn't be. (Java, for example, will compile a regex when you say `"foo, bar".split(", ")` which is quiet annoying).

And speaking of the bit encoding, here's an article about just that https://medium.com/analytics-vidhya/bioinformatics-2-bit-encoding-for-dna-sequences-9b93636e90e2
 
Last edited:

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
Just some thoughts ...
Some very good thoughts!

Here's an example regex string from my codes
CAA.{17,17}AG

There is always a prefix of 1 to 4 letters and the same is true of suffixes. The range of wild cards "w" will vary from about 15 to 25, but the interval provided in the regex string is always of the form {w,w}. Overlaps are permitted in the search.

I'm not sure if it is still true, but 20 years ago compilers for vectorized CPUs performed String searches by vectorized discrete correlation. I've considered doing this manually, for example with kernel
CAAXXXXXXXXXXXXXXXXAG

Another option is to transform characters to integers and use Wolfram's precompiled signal processing function.

It's true that Wolfram supports 8-bit character and integer types but calls to the built-in functions are strongly typed to the more conventional forms.
 
Last edited:

Gideon

Golden Member
Nov 27, 2007
1,842
4,379
136
Here's an example regex string from my codes
CAA.{17,17}AG

As the given example is a rather straightforward fixed-length regex. This got me thinking, isn't there a better way to search directly inside the bit-encoded as given in the link above:
  • A => 00
  • C => 01
  • G => 10
  • T => 11

Wrote some extremely naive code to do the latter in Javascript using a Uint8Array to store the raw bits. Here is what it does:
  • Converts the given DNA string to the aforementioned biten-coded form in Uint8Array
  • function searchDNA takes a prefix ('CAA') , a suffix ('AG') and wildcard length (17 nucleitides) as input and returns all the matched wildcard indices as an array
  • another function extractWildcards extracts the content in human readable form

(full code to the left, press "Run" to execue)

JavaScript:
// Example usage
const dnaString = "AACAAAAAAAAAAAAAAAAAAAAGG";
const bitEncodedDNA = encodeDNA(dnaString);
const decodedDNA = decodeDNA(bitEncodedDNA);
console.log('Decoded DNA Sequence:', decodedDNA);

const patternPrefix = 'CAA';
const patternSuffix = 'AG';
const wildcardLength = 17;  // 17 nucleotides 34 bits

const matches = searchDNA(bitEncodedDNA, patternPrefix, patternSuffix, wildcardLength);
console.log("Matches found at nucleotide positions:", matches);

const wildcards = extractWildcards(bitEncodedDNA, matches, patternPrefix.length, wildcardLength);

which outputs:

Code:
Decoded DNA Sequence: AACAAAAAAAAAAAAAAAAAAAAGGAAA
Matches found at nucleotide positions: [ 2 ]
Wildcard sequences found: [ 'AAAAAAAAAAAAAAAAA' ]

This is the "equivalent" Wolfram code chatgpt spits out. Unlike the above js code, it doesn't work (gives a conversion error), and it's extremely far from a "pure functional" code, but perhaps gets the point across.
Code:
(* Define the bit patterns for each nucleotide *)
nucleotideToBit = <|"A" -> "00", "C" -> "01", "G" -> "10", "T" -> "11"|>;
bitToNucleotide = <|"00" -> "A", "01" -> "C", "10" -> "G", "11" -> "T"|>;

(* Function to encode a DNA sequence into a ByteArray *)
encodeDNA[sequence_String] := Module[{bitString, paddedBitString, byteList},
  bitString = StringJoin[sequence /. nucleotideToBit];
 
  (* Pad the bitString to make it a multiple of 8 bits (if necessary) *)
  paddedBitString = StringPadRight[bitString, Ceiling[StringLength[bitString]/8]*8, "0"];
 
  (* Convert the padded bit string to a list of bytes *)
  byteList = FromDigits[#, 2] & /@ StringPartition[paddedBitString, 8];
 
  (* Return the ByteArray *)
  ByteArray[byteList]
]

(* Function to decode a ByteArray back into a DNA sequence *)
decodeDNA[byteArray_ByteArray] := Module[{bitString, nucleotideList, decodedSequence},
  bitString = StringJoin[IntegerString[#, 2, 8] & /@ Normal[byteArray]];
  nucleotideList = StringPartition[bitString, 2];
  decodedSequence = StringJoin[nucleotideList /. bitToNucleotide];
  decodedSequence
]

(* Function to convert a DNA string pattern to its bitwise integer representation *)
dnaPatternToBits[pattern_String] := StringJoin[pattern /. nucleotideToBit]

(* Function to match bits in the ByteArray with the pattern *)
matchBits[byteArray_ByteArray, startBit_, patternBits_String] := Module[
  {bitString, bitSegment},
  bitString = StringJoin[IntegerString[#, 2, 8] & /@ Normal[byteArray]];
  bitSegment = StringTake[bitString, {startBit + 1, startBit + StringLength[patternBits]}];
  bitSegment === patternBits
]

(* Function to search for the pattern using optimized bitwise comparison *)
searchDNA[byteArray_ByteArray, patternPrefix_String, patternSuffix_String, wildcardNucleotides_] := Module[
  {bitLength, prefixBits, suffixBits, prefixLength, suffixLength, wildcardLength, matches = {}, i},
  bitLength = Length[byteArray] * 8;
  prefixBits = dnaPatternToBits[patternPrefix];
  suffixBits = dnaPatternToBits[patternSuffix];
  prefixLength = StringLength[prefixBits];
  suffixLength = StringLength[suffixBits];
  wildcardLength = wildcardNucleotides * 2; (* Convert nucleotides to bits *)

  For[i = 0, i <= bitLength - (prefixLength + wildcardLength + suffixLength), i += 2,
   If[
    matchBits[byteArray, i, prefixBits] &&
    matchBits[byteArray, i + prefixLength + wildcardLength, suffixBits],
    AppendTo[matches, i/2] (* Convert bit position to nucleotide position *)
   ]
  ];
  matches
]

(* Function to extract the actual wildcard strings found between the prefix and suffix *)
extractWildcards[byteArray_ByteArray, matches_List, prefixLength_Integer, wildcardNucleotides_Integer] := Module[
  {wildcards = {}, wildcardLength, i, match, bitString, wildcardBits, nucleotideList, wildcardDNA},
  wildcardLength = wildcardNucleotides * 2; (* Convert nucleotides to bits *)

  For[i = 1, i <= Length[matches], i++,
   match = matches[[i]];
   bitString = StringJoin[IntegerString[#, 2, 8] & /@ Normal[byteArray]];
   wildcardBits = StringTake[bitString, {match * 2 + prefixLength + 1, match * 2 + prefixLength + wildcardLength}];
 
   (* Convert the bit string back to a DNA string *)
   nucleotideList = StringPartition[wildcardBits, 2];
   wildcardDNA = StringJoin[nucleotideList /. bitToNucleotide];
 
   AppendTo[wildcards, wildcardDNA];
  ];
 
  wildcards
]

(* Example usage *)
dnaString = "AACAAAAAAAAAAAAAAAAAAAAGG";
bitEncodedDNA = encodeDNA[dnaString];
decodedDNA = decodeDNA[bitEncodedDNA];
Print["Decoded DNA Sequence: ", decodedDNA];

patternPrefix = "CAA";
patternSuffix = "AG";
wildcardNucleotides = 17;

matches = searchDNA[bitEncodedDNA, patternPrefix, patternSuffix, wildcardNucleotides];
Print["Matches found at nucleotide positions: ", matches];

wildcards = extractWildcards[bitEncodedDNA, matches, StringLength[patternPrefix] * 2, wildcardNucleotides];
Print["Wildcard sequences found: ", wildcards];

Naive and stupid, but perhaps gets the point across.
I'm also sure there are multiple ways to do this with bitmask using some fancy AVX-512 instructions
 

Hermitian

Member
Sep 1, 2024
95
63
46
frostconcepts.org
@Gideon , @Nothingness

Speeding up the regex search is helpful.

The purpose of the searches is to find "candidates" that closely match the "marker". For example, the regex above is one of several possible forms of the marker
CAAATTCGACACCACCTATCAG

"Closely" means within a given number of "letter errors" and "width allowance". Letter errors (E) refers to mismatched letters, e.g. one of the T's is something else. Width allowance (wA) refers to added (inserted) or deleted letters. This parameter can be positive or negative. The nature of PCR dictates a bound on E of
0 <= E <= max(4, 20% of marker length).

The width allowance is bounded by
-E <= wA <= +E.

The fixed length regex I generate encompass all values of E and wA for a given marker. In the cases of wA >= 0, the regex must begin and end with the first and last letter of the marker. In the cases of wA < 0, it is possible the beginning or ending letters are not present. In all cases, letters considered "present" must occur in marker order. The regex do not enforce this criteria.

A typical search of a chromosome of length 20-30 million letters will return on the order of 10k results per regex. Immediately following the search, each of the candidates are "annotated" to insure the ordering criteria. Those which pass are written (with annotation) to a file, named for the specimen, chromosome, marker, E, wA, and regex parameters. The others are discarded along with the entire array of candidates.
 
Last edited:

Cogman

Lifer
Sep 19, 2000
10,283
134
106
I'm not sure if it is still true, but 20 years ago compilers for vectorized CPUs performed String searches by vectorized discrete correlation.
Well, looks like I'm behind on where regex compilers are today.

I knew that the V8 regex compiler ultimately does compile regexs down to machine code. What I was unaware of is PCRE does as well https://github.com/PCRE2Project/pcre2/issues/263 . Which is one of the more common Regex engines for many languages (Not sure if this is true for Mathmatica).

So, it is possible that your Regexs are getting optimized fully. However, just realize the more complex a regex is the more likely it is to bail on doing aggressive optimizations. Most JITs will look for simple optimizations at the method boundary and bail out if the method does something they aren't expecting. For Regex it's the same, a simple search is most likely to be optimized, anything like backtracking will likely break it into the slow version.

The point still stands about making sure Mathmatica isn't recompiling the regexes as much as possible. The more cached you can make them the faster things will be.
 
Reactions: Hermitian

cellarnoise

Senior member
Mar 22, 2017
758
411
136
If there is a "Trial" and example software to test, you could post in this sub-forum. Users have many different CPU and RAM configurations that would likely be willing to run on various hardware.

Other options, besides regular commercial cloud cpu that you might be able to get run-time stats from specific cpu / ram combinations include places like:



I wonder if a large LL3 cache would not speed up this kind of task some or a bunch? Either 3dX AMD or an Intel large unified cache cpu? Might require few threads per task?
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |