Actual vs Claimed GPU performance in terms of GFLOP/s

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
I posit that there is no basis in the advertised SP/DP "GFLOP/s" for GPUs.

Context: GK208 based GT 730 2GB GDDR5 has a claimed 692.7/28.9 GFLOP/s SP/DP performance. Aida64 OpenCL GPGPU benchmark reports ~600/29 GFLOP/s. Somewhat lower SP, but the advertised DP.

Incoming real-world FP32 test: E@H Gamma-ray Pulsar Binary Search(FGRPopencl1K-nvidia), estimated to be 525000 GFLOPs. Relevant quote:

About warnings in the logs.

Since BOINC does not report FP64 support, a dummy kernel compile check using FP64 is performed when OpenCL device is opened. If FP64 is OK, we use the GPU for almost everything (even sorting results). If the device does not support FP64, all kernels requiring "double" support are performed by the CPU (about 10x slower).

If you see "OpenCL device has FP64 support" in the logs, it means that the GPU has been recognized to support double floating point. Don't worry about performance, double precision is not the major part of processing.

On OSX, there are lots of warning compiling the FFT library, but this is harmless and should be ignored.

As Bernd said, we are still having issues with the Windows driver. I hope we will find soon what's causing the biggest OpenCL kernel to fail on Windows only.

Christophe

Time to completion on the GT730: ~4hrs approx.

Assuming all of the computation is carried out in the GPU, which probably isn't true because one CPU core is always loaded, actual performance is

525000/(4*3600) = 36.5 GFLOP/s FP32.

Basically, whenever you run a real-world application, the obtained performance is nowhere near the numbers as claimed by the manufacturers.
 

Krteq

Senior member
May 22, 2015
993
672
136
Well, those GFLOP/s stated by IHVs are simply calculated numbers for maximum theoretical performance.

You can't have 100% GPU occupancy in real world
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
I think that even the theoretical performance numbers are dubious because such performance cannot be sustained given the kind of memory bandwidth a GPU has, which is typically a few hundred GB/s. In other words the established 1 FMA op per GPU "core" per clock method to report GFLOP/s is only a convenient way of making GPUs look better than they actually are compared to CPUs.
 

Krteq

Senior member
May 22, 2015
993
672
136
Yeah, that equation is related to ALU only. You will always be limited by memory bandwidth etc.
 

BFG10K

Lifer
Aug 14, 2000
22,709
2,979
126
I think that even the theoretical performance numbers are dubious because such performance cannot be sustained given the kind of memory bandwidth a GPU has, which is typically a few hundred GB/s.
In theory you could craft a small piece of code that does nothing except increments a single value using all of the ALUs, and that value sits in one of the registers so it doesn't use any other units or any memory bandwidth.

Of course such code is useless in the real world, but then theoretical specs (e.g. MIPs) have always been rather useless to predict performance of real code, especially on modern CPUs/GPUs. With GPUs in particular, the driver can have a bigger performance impact than the actual hardware specs.
 

Deders

Platinum Member
Oct 14, 2012
2,401
1
91
The Volta Titan has been designed to make better use of scheduling for these kinds of tasks so it can maintain higher efficiency, much closer to the theoretical throughput.
 

Krteq

Senior member
May 22, 2015
993
672
136
Well, in fact they are coming back to Fermi style scheduling approach and relying more on HW scheduler instead of driver/compiler approach. Vega has also many changes in ACEs/HWSs to achieve better thread occupancy etc.

IHVs are continuously trying to improve their schedulers in their uarchs.

Anyway front-end is only one part of a problem, there are many other bottlenecks -> back-end, caches, memory subsystem bandwidth etc.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
The Volta Titan has been designed to make better use of scheduling for these kinds of tasks so it can maintain higher efficiency, much closer to the theoretical throughput.
Kepler reaches around 1/20th of its supposed 'peak' performance in this workload. I'd be surprised if Volta does better than 1/10th.

To reach even half of what the Titan V can do, ie. 6-7 TFLOP/s, you need to have over 3TB/s memory bandwidth.

As long as there isn't a breakthrough in memory technology, no GPU will ever come close to what they supposedly can do.
 

geoxile

Senior member
Sep 23, 2014
327
25
91
Kepler reaches around 1/20th of its supposed 'peak' performance in this workload. I'd be surprised if Volta does better than 1/10th.

To reach even half of what the Titan V can do, ie. 6-7 TFLOP/s, you need to have over 3TB/s memory bandwidth.

As long as there isn't a breakthrough in memory technology, no GPU will ever come close to what they supposedly can do.
Simple, we just stack HBM even higher
 

antihelten

Golden Member
Feb 2, 2012
1,764
274
126
Kepler reaches around 1/20th of its supposed 'peak' performance in this workload.

If Kepler can only reach 1/20th of its theoretical performance, then it isn't really an FP32 test, it's a bandwidth test (or whatever else may be bottlenecking it), and as such it is useless for making any inferences about the FP32 capabilities of Kepler.
 
Reactions: xpea and Muhammed

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
If Kepler can only reach 1/20th of its theoretical performance, then it isn't really an FP32 test, it's a bandwidth test (or whatever else may be bottlenecking it), and as such it is useless for making any inferences about the FP32 capabilities of Kepler.
Every GPGPU workload is a bandwidth test if the data that you're sending over to the GPU is to be useful in some sense.

I mean sure you can load up the memory subsystem with a useless benchmark that will report X teraflop/s but that tells you nothing about what the GPU can actually do.

The real performance of a GPU is always going to be constrained by memory bandwidth.

You might have a massive GPU die decked out with zillion little ALUs but they're going to do nothing unless your memory is up to the task.

It's much harder to increase memory bandwidth compared to raw processing power.
 
Reactions: Krteq

Muhammed

Senior member
Jul 8, 2009
453
199
116
This isn't the 60s. Even a Casio calculator can solve Ax=B for a 3x3 system in less than a second.
And a top end GPU does 14 trillion of these ops in a second, to serve the purpose of graphics, AI, and some general purpose compute. They don't need more than that. Obviously if your calculations are more complex then the throughput will be reduced.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
And a top end GPU does 14 trillion of these ops in a second, to serve the purpose of graphics, AI, and some general purpose compute. They don't need more than that. Obviously if your calculations are more complex then the throughput will be reduced.
Yes, any realistic workload involves more than just multiplying two numbers and storing it in a variable. GPUs boast such numbers to lead people into believing that they're orders of magnitude better than CPUs, when actual performance isn't that high compared to a CPU running optimized code.
 
Reactions: Krteq

BFG10K

Lifer
Aug 14, 2000
22,709
2,979
126
GPUs boast such numbers to lead people into believing that they're orders of magnitude better than CPUs, when actual performance isn't that high compared to a CPU running optimized code.
Eh? It's nothing to do with boasting and everything to do with real-world performance. FLOPS/MIPs/SPEC scores are meaningless BS.

If you tried any modern AAA graphics workload on a software rasterizer you'd be getting one frame an hour, if even that. Even 20+ years ago something like GLQuake/Unreal were far faster on video cards compared to CPU rendering, and looked much better as well. These days the gap is vastly more enormous, probably in line with Moore's law.
 
Reactions: Muhammed and xpea

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
Eh? It's nothing to do with boasting and everything to do with real-world performance. FLOPS/MIPs/SPEC scores are meaningless BS.

If you tried any modern AAA graphics workload on a software rasterizer you'd be getting one frame an hour, if even that. Even 20+ years ago something like GLQuake/Unreal were far faster on video cards compared to CPU rendering, and looked much better as well. These days the gap is vastly more enormous, probably in line with Moore's law.
I'm not sure if graphics and compute are directly comparable. As for SPEC, the score is calculated against a baseline and even though the benchmarks can be fine-tuned, they're all derived from real-world code.

As far as CPUs are concerned SPEC is a far better benchmark than Cinebench.
 
May 11, 2008
20,040
1,287
126

I think what we need to keep in mind that there is a difference between peak performance and sustained performance.

A small part of the code that is called often can be optimized to only use the local registers stack and will come near the theoretical numbers but all other code will not even with smart prefetching of data. If we would look at throughput in a graph the plot line would jump up and down all the time.
Not an issue for the initiated, it is still all extremely fast because of being highly parallel.
 

Bouowmx

Golden Member
Nov 13, 2016
1,139
550
146
Do BOINC projects set the number of FLOP in a task correctly?

PrimeGrid PPS sieve (number sieve): estimated 49 238 FLOP
Hardware used: MSI GeForce GT 730 GK208, 1202 MHz (overclocked), 923 GFLOPS

Task appears to have almost no need for memory bandwidth, so I set memory frequency to the minimum.

Results: https://www.primegrid.com/results.php?hostid=529108&state=4&offset=0
Time to complete: 4600 s. However, I configured BOINC client to run two tasks simultaneously. See images, with "Resources" row saying "0.5 NVIDIA GPUs". Doing so will improve efficiency: two simultaneous tasks does not double their run times.
Effective time per task: 2300 s.

(49 238 GFLOP)/(2300 s) = 21.41 GFLOPS

 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
I think that even the theoretical performance numbers are dubious because such performance cannot be sustained given the kind of memory bandwidth a GPU has, which is typically a few hundred GB/s.

You're neglecting the registers and the cache. Internal bandwidth in a modern high end GPU is easily in the TB/s range.
 

VirtualLarry

No Lifer
Aug 25, 2001
56,447
10,117
126
I think that even the theoretical performance numbers are dubious because such performance cannot be sustained given the kind of memory bandwidth a GPU has, which is typically a few hundred GB/s. In other words the established 1 FMA op per GPU "core" per clock method to report GFLOP/s is only a convenient way of making GPUs look better than they actually are compared to CPUs.
In other news, WiFi is advertised by the sum-total of all of the theoretical channel bandwidth, of all channels and frequencies that it simultaneously supports. News at 11.

IOW, tech companies advertise theoretical bandwidth all of the time.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
You're neglecting the registers and the cache. Internal bandwidth in a modern high end GPU is easily in the TB/s range.
Dataset size in a typical GPGPU application is usually kept limited to not exceed the VRAM. Register size is not much relevant in determining overall performance.
In other news, WiFi is advertised by the sum-total of all of the theoretical channel bandwidth, of all channels and frequencies that it simultaneously supports. News at 11.

IOW, tech companies advertise theoretical bandwidth all of the time.
Which is fine, as long as GPU makers refrain from claiming bogus 100x speedups Vs CPUs, which is what they always tend to do.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
Do BOINC projects set the number of FLOP in a task correctly?

PrimeGrid PPS sieve (number sieve): estimated 49 238 FLOP
Hardware used: MSI GeForce GT 730 GK208, 1202 MHz (overclocked), 923 GFLOPS

Task appears to have almost no need for memory bandwidth, so I set memory frequency to the minimum.

Results: https://www.primegrid.com/results.php?hostid=529108&state=4&offset=0
Time to complete: 4600 s. However, I configured BOINC client to run two tasks simultaneously. See images, with "Resources" row saying "0.5 NVIDIA GPUs". Doing so will improve efficiency: two simultaneous tasks does not double their run times.
Effective time per task: 2300 s.

(49 238 GFLOP)/(2300 s) = 21.41 GFLOPS

You seem to have a very high core overclock, my card starts causing game crashes if I exceed 1100 MHz. Is that task single or double precision?
 

VirtualLarry

No Lifer
Aug 25, 2001
56,447
10,117
126
Which is fine, as long as GPU makers refrain from claiming bogus 100x speedups Vs CPUs, which is what they always tend to do.
"bogus"? Hyperbole much? For DC projects, like F@H, the ratio of PPD between the CPU tasks, on a modern Intel AVX-enabled quad-core, versus a modern NV Pascal-based GPU, are indeed on the order of 100x difference (or more!) in favor of the GPU.

You seem to want to paint a picture of GPUs being only marginally faster than CPUs, but that's not accurate, they really are a magnitude (or two!) speedier. Workload permitting, of course. You might just have a poor workload for them.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |