Are CPUs with the AVX instruction set capable of double floating point precision or just single? If they're capable of double floating point precision, then what's performance penalty compared to FP32 precision, if any?
I thought all CPUs were measured (FLOPS) in doubled percision.
Are CPUs with the AVX instruction set capable of double floating point precision or just single? If they're capable of double floating point precision, then what's performance penalty compared to FP32 precision, if any?
Heh, I would've guessed SP so you get a higher number.
Single percision is very limited. It works OK for GPUs because their role is also limited. Most of the new GPUs that are being used for other tasks like Tesla, have increased double percision performance.
CPUs need to be well rounded to handle all tasks. So measuring CPUs in SP would be a very useless number in my opinion. I could be wrong however.
Single percision is very limited. It works OK for GPUs because their role is also limited. Most of the new GPUs that are being used for other tasks like Tesla, have increased double percision performance.
CPUs need to be well rounded to handle all tasks. So measuring CPUs in SP would be a very useless number in my opinion. I could be wrong however.
They have two different FLOPS counts, one for sp, and another for dp. It's not really an apples-to-apples comparison when one FLOPS counts has 32 bits more to deal with.
I realize that. And we see that all the time with GPUs. But rarely, if ever, do I see different numbers released for CPUs.
I wouldn't be surprised if CPUs start to advertise SP and DP in the near future with the fused CPU/GPU architectures.
I realize that. And we see that all the time with GPUs. But rarely, if ever, do I see different numbers released for CPUs.
AVX Instructions have a 256-bit (32 byte) wide-registers with a total of 16 registers. AVX instructions are capable of performing either single precision ordouble precision depending on the application. With the 32 byte width, one can place up to 8 floats or 4 doubles in one register.
You can then issue an instruction that can do calculations on the 8 floats, or a calculation that performs calculations on 4 doubles. This is known as vectorization (e.g: single instruction multiple data)
Generally, single precision is faster than their double counterparts. I'm sure someone else who is versed on the subject can explain better... From my own perspective, more bits == more work. Obviously, there is more bits of precision in the IEEE double format, so you would generally get better precision using doubles. Note though, inherent rounding in floating point is inevitable.
From what I see, SP and DP instructions are typically the same latency (with the exception of SQRT and DIV). It's just that you have more packed SP vectors than DP and so SP throughput SHOULD be twice as high.
Are CPUs with the AVX instruction set capable of double floating point precision or just single?
If they're capable of double floating point precision, then what's performance penalty compared to FP32 precision, if any?
Not always twice the performance between SP and DP, no? I originally thought that this was the case, but looking at NVIDIA GPUs, DP is very suboptimal compared to SP. Not just half the performance.... maybe in the order of a fifth
Just looking at the numbers, an FP add will take the same # of cycles whether it's 8 SP values or 4 DP values. Any additional performance deltas from that is due to secondary effects, for example packing and unpacking a series of unpacked FP numbers prior to the FP add. So you may be correct that the performance delta between SP and DP is more than a 5x, but it would be due to secondary effects (loads, stores, instruction flows) and not on the basic level of how long it takes to do an ADD or MUL