AVX question

Anarchist420 · Mar 3, 2012

Are CPUs with the AVX instruction set capable of double floating point precision or just single? If they're capable of double floating point precision, then what's performance penalty compared to FP32 precision, if any?

Abwx · Mar 3, 2012

Double precision and also double the throughput compared to
regular SSE2 , albeit it s unlikely that a soft can use only AVX.

Edrick · Mar 3, 2012

I thought all CPUs were measured (FLOPS) in doubled percision.

TuxDave · Mar 3, 2012

Edrick said:
I thought all CPUs were measured (FLOPS) in doubled percision.

Heh, I would've guessed SP so you get a higher number.

mustard010 · Mar 3, 2012

Anarchist420 said:
Are CPUs with the AVX instruction set capable of double floating point precision or just single? If they're capable of double floating point precision, then what's performance penalty compared to FP32 precision, if any?

AVX Instructions have a 256-bit (32 byte) wide-registers with a total of 16 registers. AVX instructions are capable of performing either single precision ordouble precision depending on the application. With the 32 byte width, one can place up to 8 floats or 4 doubles in one register.

You can then issue an instruction that can do calculations on the 8 floats, or a calculation that performs calculations on 4 doubles. This is known as vectorization (e.g: single instruction multiple data)

Generally, single precision is faster than their double counterparts. I'm sure someone else who is versed on the subject can explain better... From my own perspective, more bits == more work. Obviously, there is more bits of precision in the IEEE double format, so you would generally get better precision using doubles. Note though, inherent rounding in floating point is inevitable.

Edrick · Mar 3, 2012

TuxDave said:
Heh, I would've guessed SP so you get a higher number.

Single percision is very limited. It works OK for GPUs because their role is also limited. Most of the new GPUs that are being used for other tasks like Tesla, have increased double percision performance.

CPUs need to be well rounded to handle all tasks. So measuring CPUs in SP would be a very useless number in my opinion. I could be wrong however.

mustard010 · Mar 3, 2012

Edrick said:
Single percision is very limited. It works OK for GPUs because their role is also limited. Most of the new GPUs that are being used for other tasks like Tesla, have increased double percision performance.

CPUs need to be well rounded to handle all tasks. So measuring CPUs in SP would be a very useless number in my opinion. I could be wrong however.

They have two different FLOPS counts, one for sp, and another for dp. It's not really an apples-to-apples comparison when one FLOPS counts has 32 bits more to deal with.

TuxDave · Mar 3, 2012

Edrick said:
Single percision is very limited. It works OK for GPUs because their role is also limited. Most of the new GPUs that are being used for other tasks like Tesla, have increased double percision performance.

CPUs need to be well rounded to handle all tasks. So measuring CPUs in SP would be a very useless number in my opinion. I could be wrong however.

I was mostly referring to what goes on marketing slides.

Edrick · Mar 3, 2012

mustard010 said:
They have two different FLOPS counts, one for sp, and another for dp. It's not really an apples-to-apples comparison when one FLOPS counts has 32 bits more to deal with.

I realize that. And we see that all the time with GPUs. But rarely, if ever, do I see different numbers released for CPUs.

mustard010 · Mar 3, 2012

Edrick said:
I realize that. And we see that all the time with GPUs. But rarely, if ever, do I see different numbers released for CPUs.

Probably because traditional CPUs are slow w.r.t. GFLOP ratings. GPUs were designed with many stamped out vector processes, right? I wouldn't be surprised if CPUs start to advertise SP and DP in the near future with the fused CPU/GPU architectures.

Edrick · Mar 3, 2012

mustard010 said:
I wouldn't be surprised if CPUs start to advertise SP and DP in the near future with the fused CPU/GPU architectures.

Perhaps. I just find the SP number as not important on the CPU side. Hell, even some DC applications (milkyway@home for example) no longer work on SP only GPUs.

TuxDave · Mar 3, 2012

Edrick said:
I realize that. And we see that all the time with GPUs. But rarely, if ever, do I see different numbers released for CPUs.

Well if you can find a slide that posts the peak theoretical flops, we can easily do the math to figure out if it's SP or DP.

TuxDave · Mar 3, 2012

mustard010 said:
AVX Instructions have a 256-bit (32 byte) wide-registers with a total of 16 registers. AVX instructions are capable of performing either single precision ordouble precision depending on the application. With the 32 byte width, one can place up to 8 floats or 4 doubles in one register.

You can then issue an instruction that can do calculations on the 8 floats, or a calculation that performs calculations on 4 doubles. This is known as vectorization (e.g: single instruction multiple data)

Generally, single precision is faster than their double counterparts. I'm sure someone else who is versed on the subject can explain better... From my own perspective, more bits == more work. Obviously, there is more bits of precision in the IEEE double format, so you would generally get better precision using doubles. Note though, inherent rounding in floating point is inevitable.

From what I see, SP and DP instructions are typically the same latency (with the exception of SQRT and DIV). It's just that you have more packed SP vectors than DP and so SP throughput SHOULD be twice as high.

mustard010 · Mar 3, 2012

TuxDave said:
From what I see, SP and DP instructions are typically the same latency (with the exception of SQRT and DIV). It's just that you have more packed SP vectors than DP and so SP throughput SHOULD be twice as high.

Not always twice the performance between SP and DP, no? I originally thought that this was the case, but looking at NVIDIA GPUs, DP is very suboptimal compared to SP. Not just half the performance.... maybe in the order of a fifth

AtenRa · Mar 3, 2012

Anarchist420 said:
Are CPUs with the AVX instruction set capable of double floating point precision or just single?

All current CPUs from AMD and Intel are 64bit, that means double floating 64FP. That is in Legacy mode.

When we have SIMD instructions (SSE) we can have 128bit which is 2x64 or 4x32 etc.

AVX256 instructions are 256bit, that means that the FPU can execute 4x 64bit or 8x32bit etc.

Anarchist420 said:
If they're capable of double floating point precision, then what's performance penalty compared to FP32 precision, if any?

Calculating in 64bit outputs a higher precision number than 32bit but it takes longer to calculate the same instruction.

IntelUser2000 · Mar 3, 2012

In CPUs, Single Precision values are 2x Double Precision values.

In the case of GPUs, Tesla GPUs achieve the same, 2x SP = DP.

TuxDave · Mar 3, 2012

mustard010 said:
Not always twice the performance between SP and DP, no? I originally thought that this was the case, but looking at NVIDIA GPUs, DP is very suboptimal compared to SP. Not just half the performance.... maybe in the order of a fifth

Just looking at the numbers, an FP add will take the same # of cycles whether it's 8 SP values or 4 DP values. Any additional performance deltas from that is due to secondary effects, for example packing and unpacking a series of unpacked FP numbers prior to the FP add. So you may be correct that the performance delta between SP and DP is more than a 5x, but it would be due to secondary effects (loads, stores, instruction flows) and not on the basic level of how long it takes to do an ADD or MUL

mustard010 · Mar 4, 2012

TuxDave said:
Just looking at the numbers, an FP add will take the same # of cycles whether it's 8 SP values or 4 DP values. Any additional performance deltas from that is due to secondary effects, for example packing and unpacking a series of unpacked FP numbers prior to the FP add. So you may be correct that the performance delta between SP and DP is more than a 5x, but it would be due to secondary effects (loads, stores, instruction flows) and not on the basic level of how long it takes to do an ADD or MUL

Thanks for clarifying. After posting an earlier post, I immediately checked the Tesla SP vs. DP performance, and lo and behold SP was indeed 2x as fast as DP. Then again, this is just raw FLOPS and doesn't take into account memory movement as you say.

AVX question

Anarchist420

Diamond Member

Abwx

Lifer

Edrick

Golden Member

TuxDave

Lifer

mustard010

Member

Edrick

Golden Member

mustard010

Member

TuxDave

Lifer

Edrick

Golden Member

mustard010

Member

Edrick

Golden Member

TuxDave

Lifer

TuxDave

Lifer

mustard010

Member

AtenRa

Lifer

IntelUser2000

Elite Member

TuxDave

Lifer

mustard010

Member

TRENDING THREADS