Discussion Rudi_Float_Bench v0.02a

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Jul 27, 2020
23,075
16,243
146
Now if we can talk @Det0x to do it on his 9950x !
No way his CPU is beating yours. I would be seriously surprised if he can

I mean it's basically 32 threads at roughly 5.7 GHz. There is a slight chance but I haven't checked how this benchmark scales with SMT so that's an unknown. All I can think of is that he has 16 real cores vs. 64 of your cores. That's a crazy comparison.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,755
15,789
136
No way his CPU is beating yours. I would be seriously surprised if he can

I mean it's basically 32 threads at roughly 5.7 GHz. There is a slight chance but I haven't checked how this benchmark scales with SMT so that's an unknown. All I can think of is that he has 16 real cores vs. 64 of your cores. That's a crazy comparison.
SMT OFF for the last run of 2308. See the picture.
 
Reactions: igor_kavinski
Jul 27, 2020
23,075
16,243
146
SMT OFF for the last run of 2308. See the picture.
My guess is that on Zen 2 and Zen 5 server CPUs, the FPU units get clobbered with too much context switching (pausing the real thread, switching to virtual thread and so on, for who knows how many times per millisecond) which decreases their throughput with SMT on.

Need to investigate the SMT/HT impact of other x86 CPU models/architectures to see if there is some design that manages a net benefit with SMT on.

So I'm betting that Det0x will also get his best score with SMT off.
 
Jul 27, 2020
23,075
16,243
146
Overclock.net user Veii's 285K result:



Fun fact: Those asterisks and pipe characters? Those are the number of real cores in the CPU. Unfortunately, no plan to display HT/SMT virtual cores. That's not gonna be a trivial task. Apparently, all threads ARE recognized by the benchmark, including SMT/HT threads.

User SizzlinChips' 285K result:


Seems Arrow Lake is tricky even with microbenchmarks to get consistent performance out of.
 
Last edited:

Det0x

Golden Member
Sep 11, 2014
1,412
4,784
136
Now if we can talk @Det0x to do it on his 9950x !

There is a slight chance but I haven't checked how this benchmark scales with SMT so that's an unknown. All I can think of is that he has 16 real cores vs. 64 of your cores. That's a crazy comparison.

My guess is that on Zen 2 and Zen 5 server CPUs, the FPU units get clobbered with too much context switching (pausing the real thread, switching to virtual thread and so on, for who knows how many times per millisecond) which decreases their throughput with SMT on.

Need to investigate the SMT/HT impact of other x86 CPU models/architectures to see if there is some design that manages a net benefit with SMT on.

So I'm betting that Det0x will also get his best score with SMT off.
Kinda strange benchmark, i'm seeing almost 100% thread scaling on 16 core Zen5 with SMT enabled/disabled

16/16 = 1031 points


16/32 = 1991
 
Reactions: igor_kavinski

LightningZ71

Platinum Member
Mar 10, 2017
2,065
2,504
136
What's the memory footprint of this bench? Usually, when I see a task get almost no benefit from SMT, it's because the cache footprint of two instances of the thread is too big for the same cache level as only one instancefor example, if one thread has a footprint of 192KB on a processor with a 256KB L2, one thread will run completely out of L2. When the second thread starts on the core with SMT enabled, now there's 384KB of data trying to fit in 256KB of cache. It's going to constantly be swapping data in and out of L2. When you see it really impact older cores with smaller L2, but fly on newer ones with larger L2, that's often verification. The same effect can be seen on tasks that are small enough to live in L1 alone, but spill with two copies.
 

Schmide

Diamond Member
Mar 7, 2002
5,635
832
126
Visual Studio 2022 is apparently putting in an alternate AVX2 codepath. Seems they wisened up and did the right thing, for once.

The actual hot loop code:

View attachment 119235

An issue here? The loop is polling the high resolution clock which generally has a period of std::ratio<1,1000000000>. So at best it can operate at 1ghz and your loop can queue operations at a higher rate. Moreover its operation is most likely a fenced memory operation meaning it runs and reads from outside the cache.

Maybe have a subloop that iterates at a higher rate than the clock check.

i.e.

Code:
volatile float va = 1.0f; // should prevent optimizing the loop 
__m512 a = _mm512_set1_ps(va);
__m512 b = _mm512_set1_ps(0.0f);
int subLoop = 100;
do {
   b = _mm512_add_ps(b, a);
} while (--subLoop);
va = *reinterpret_cast<float *>(&b);

// then check clock
 
Reactions: igor_kavinski
Jul 27, 2020
23,075
16,243
146
Maybe have a subloop that iterates at a higher rate than the clock check.
Thanks! Will run some tests to see the difference.

Do you think there is some issue in the following code that is preventing the benchmark thread from getting loaded onto cores >32?

C++:
 for (DWORD i = 0; i < sysInfo.dwNumberOfProcessors; ++i) {
     threads.emplace_back(BenchmarkThread, std::ref(totalOps));
 }
 

Nothingness

Diamond Member
Jul 3, 2013
3,239
2,293
136
Do you think there is some issue in the following code that is preventing the benchmark thread from getting loaded onto cores >32?

C++:
 for (DWORD i = 0; i < sysInfo.dwNumberOfProcessors; ++i) {
     threads.emplace_back(BenchmarkThread, std::ref(totalOps));
 }
Random thought from someone who has no Windows programming knowledge: does that sysinfo structure work for systems with multiple sockets?
 

Schmide

Diamond Member
Mar 7, 2002
5,635
832
126
Thanks! Will run some tests to see the difference.

Do you think there is some issue in the following code that is preventing the benchmark thread from getting loaded onto cores >32?

C++:
 for (DWORD i = 0; i < sysInfo.dwNumberOfProcessors; ++i) {
     threads.emplace_back(BenchmarkThread, std::ref(totalOps));
 }

If you're threading and counting operations you need to use an atomic_ref. (or int)

Good example https://en.cppreference.com/w/cpp/atomic/atomic

Edit: note. Using atomics will often flush certain areas and affect total throughput. So in general do a fair amount of work, then atomic sum.
 
Jul 27, 2020
23,075
16,243
146
atomic is being used: "void BenchmarkThread(std::atomic<uint64_t>& totalOps)"

What I don't understand is, why is there an invisible limit that causes the benchmark thread to refuse to go above 64 threads?

See below:




EDIT: typo. Wrote 32 instead of 64 threads. The limit seems to be 64 threads.
 
Last edited:
Jul 27, 2020
23,075
16,243
146
Maybe have a subloop that iterates at a higher rate than the clock check.
Sadly, the score dropped to below 300 on the 48 thread Xeon

Looks like gotta leave v0.01 as it is and try to change the approach for the next version (maybe direct comparison between AVX2 and AVX512 to see how much speedup AVX512 gives).
 
Jul 27, 2020
23,075
16,243
146
Maybe have a subloop that iterates at a higher rate than the clock check.
You were right!

The mistake I made was failing to increment the ops in the subloop.

Performance increased four times!

Just going to run a few more tests before dropping the AVX-512-only binary.
 

MS_AT

Senior member
Jul 15, 2024
525
1,107
96
An issue here? The loop is polling the high resolution clock which generally has a period of std::ratio<1,1000000000>. So at best it can operate at 1ghz and your loop can queue operations at a higher rate. Moreover its operation is most likely a fenced memory operation meaning it runs and reads from outside the cache.

Maybe have a subloop that iterates at a higher rate than the clock check.

i.e.

Code:
volatile float va = 1.0f; // should prevent optimizing the loop 
__m512 a = _mm512_set1_ps(va);
__m512 b = _mm512_set1_ps(0.0f);
int subLoop = 100;
do {
   b = _mm512_add_ps(b, a);
} while (--subLoop);
va = *reinterpret_cast<float *>(&b);

// then check clock
I would double check with the diassembly. You can use https://godbolt.org/ for that, if diasebling the actual binary would prove problematic.

I mean since the loop does not touch the volatile variable compiler should be able to replace the loop with the constant unless the reinterpret_cast is confusing it.

And if the loop is not optimized away the loop carried dependency will not allow hitting optimal performance, since every iteration depends on results of previous iteration. Since most cpus are able to execute 2 fadds per cycle, second unit will not be used in parallel. Introducing another accumulator could help with this. Then you could try to account for add latency but since I guess the benchmark is trying to be generic, that would be going too far.

The simplest way to ensure the compiler will not optimize the loop away is to store accumulators to memory, ideally outside the loop. Load the initial value also outside of the loop from memory so compiler won't be able to assume anything about the data.
 
Last edited:

Schmide

Diamond Member
Mar 7, 2002
5,635
832
126
I would double check with the diassembly. You can use https://godbolt.org/ for that, if diasebling the actual binary would prove problematic.

I mean since the loop does not touch the volatile variable compiler should be able to replace the loop with the constant unless the reinterpret_cast is confusing it.

And if the loop is not optimized away the compiler won't be able to unroll it due to carried loop dependency. Since most cpus are able to execute 2 fadds per cycle, you are leaving performance on the table. Introducing another accumulator could help with this. Then you could try to account for add latency but since I guess the benchmark is trying to be generic, that would be going too far.

The simplest way to ensure the compiler will not optimize the loop away is to store accumulators to memory, ideally outside the loop. Load the initial value also outside of the loop from memory so compiler won't be able to assume anything about the data.

The loop is dependent on the volatile so it will be unable to reduce it without treating it as a constant.

Compilers generally will not reduce intrinsics as they are explicit instructions.

https://godbolt.org/z/YzsE7zhhM

gcc unrolls it twice. clang unrolls it 10 times.

The loop is a chain. Each instruction is dependent on the previous one, so there is no way to eek out more performance.

It seems godbolt doesn't execute avx512
 
Reactions: MS_AT

MS_AT

Senior member
Jul 15, 2024
525
1,107
96
The loop is dependent on the volatile so it will be unable to reduce it without treating it as a constant.

Compilers generally will not reduce intrinsics as they are explicit instructions.

https://godbolt.org/z/YzsE7zhhM

gcc unrolls it twice. clang unrolls it 10 times.

The loop is a chain. Each instruction is dependent on the previous one, so there is no way to eek out more performance.

It seems godbolt doesn't execute avx512
Yup, my bad, writing after hiking is apparently bad idea. What I wanted to say by introducing another accumulator you could eek out more performance since right now your code read and write to zmm0 every cycle. Since this is read after write OoO engine cannot do anything about it. If you lowered the loop count to 50, and introduced c variable (being handled exactly as b), then it will use more registers. And since usually you have more than one add unit it will do that in parallel.
I have edited my post for clarity. I hope I have not messed anything again.
 
Reactions: Schmide
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |