- Jan 5, 2017
- 3,865
- 3,729
- 136
I posit that there is no basis in the advertised SP/DP "GFLOP/s" for GPUs.
Context: GK208 based GT 730 2GB GDDR5 has a claimed 692.7/28.9 GFLOP/s SP/DP performance. Aida64 OpenCL GPGPU benchmark reports ~600/29 GFLOP/s. Somewhat lower SP, but the advertised DP.
Incoming real-world FP32 test: E@H Gamma-ray Pulsar Binary Search(FGRPopencl1K-nvidia), estimated to be 525000 GFLOPs. Relevant quote:
Time to completion on the GT730: ~4hrs approx.
Assuming all of the computation is carried out in the GPU, which probably isn't true because one CPU core is always loaded, actual performance is
525000/(4*3600) = 36.5 GFLOP/s FP32.
Basically, whenever you run a real-world application, the obtained performance is nowhere near the numbers as claimed by the manufacturers.
Context: GK208 based GT 730 2GB GDDR5 has a claimed 692.7/28.9 GFLOP/s SP/DP performance. Aida64 OpenCL GPGPU benchmark reports ~600/29 GFLOP/s. Somewhat lower SP, but the advertised DP.
Incoming real-world FP32 test: E@H Gamma-ray Pulsar Binary Search(FGRPopencl1K-nvidia), estimated to be 525000 GFLOPs. Relevant quote:
About warnings in the logs.
Since BOINC does not report FP64 support, a dummy kernel compile check using FP64 is performed when OpenCL device is opened. If FP64 is OK, we use the GPU for almost everything (even sorting results). If the device does not support FP64, all kernels requiring "double" support are performed by the CPU (about 10x slower).
If you see "OpenCL device has FP64 support" in the logs, it means that the GPU has been recognized to support double floating point. Don't worry about performance, double precision is not the major part of processing.
On OSX, there are lots of warning compiling the FFT library, but this is harmless and should be ignored.
As Bernd said, we are still having issues with the Windows driver. I hope we will find soon what's causing the biggest OpenCL kernel to fail on Windows only.
Christophe
Time to completion on the GT730: ~4hrs approx.
Assuming all of the computation is carried out in the GPU, which probably isn't true because one CPU core is always loaded, actual performance is
525000/(4*3600) = 36.5 GFLOP/s FP32.
Basically, whenever you run a real-world application, the obtained performance is nowhere near the numbers as claimed by the manufacturers.