CPU archtecture getting rid of FPU

thebcomputerguy

Junior Member
Oct 6, 2017
7
0
1
I've been thinking about CPU architecture.

Why wouldn't we get rid of the FPU and use larger ALU for example 532 bit ALU that can basically get rid of FPU and do pure integer math.

Do we really need more than 512 bits?
 

thebcomputerguy

Junior Member
Oct 6, 2017
7
0
1
Because many programs, even lots of games depend on the FPU now.
Any float, double, long can be mapped to an integer value if given enough bits.

Sure such large ALU will require more power but the FPU on most CPU use 4x the power of the integer part plus a whole lot of additional complexity.

For reduction in complexity alone, I think this could be a worth while avenue to take a look at.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,223
136
I don't want to say that they got rid of FPU completely, but they want to reduce dependency on FPU and relied more on integer unit.
Not exactly true. The big performance drop, what exactly caused the performance tank was the general purpose cores.

The FPU is what is mostly carrying the deficiencies in the general purpose cores. Essentially, the Bulldozer-Excavator cores were no better than a more optimized and higher clocked Bobcat-Jaguar core. While the FPU is vastly better than what is in the Zen core. AMD also had a couple years to improve the FPUs inefficiencies. They even had a more optimized FMAC in test chips. Which could in fact split and is still more efficient than bridging FMUL and FADD with lower dependency latency.

They could have easily fixed the Bulldozer general purpose by actually fully utilizing the Alpha 21264 core. Which had FOUR ALUs, not the mostly two ALUs and the completely gimped AGLUs. Essentially, AMD Bulldozer had a vastly improved Alpha 21264 front-end, two gimped Alpa 21264 general purpose cores, a vastly improved Alpha 21264 FPU(2 Alpha 21264 FPUs in FMAC + 2 FMISC from Stars).

It clearly sucks that AMD denied Bulldozer designs to use 2009-2012 improvements that were tested internally.
I've been thinking about CPU architecture.

Why wouldn't we get rid of the FPU and use larger ALU for example 532 bit ALU that can basically get rid of FPU and do pure integer math.

Do we really need more than 512 bits?
Different tasks, thus different optimizations. Physically, larger units mean more area, more power, and less frequency.

FPU is actually a bunch of 32-bit and 64-bit units running a single instruction. So, size isn't really a problem its moving all that data at once that is. Load/store is super energy intensive, so larger loads/stores is about the same as many loads/stores. Also, just because it is called FPU doesn't mean it isn't Integer.
 
Last edited:

thebcomputerguy

Junior Member
Oct 6, 2017
7
0
1
Not exactly true. The big performance drop, what exactly caused the performance tank was the general purpose cores.

The FPU is what is mostly carrying the deficiencies in the general purpose cores. Essentially, the Bulldozer-Excavator cores were no better than a more optimized and higher clocked Bobcat-Jaguar core. While the FPU is vastly better than what is in the Zen core. AMD also had a couple years to improve the FPUs inefficiencies. They even had a more optimized FMAC in test chips. Which could in fact split and is still more efficient than bridging FMUL and FADD with lower dependency latency.

They could have easily fixed the Bulldozer general purpose by actually fully utilizing the Alpha 21264 core. Which had FOUR ALUs, not the mostly two ALUs and the completely gimped AGLUs. Essentially, AMD Bulldozer had a vastly improved Alpha 21264 front-end, two gimped Alpa 21264 general purpose cores, a vastly improved Alpha 21264 FPU(2 Alpha 21264 FPUs in FMAC + 2 FMISC from Stars).

It clearly sucks that AMD denied Bulldozer designs to use 2009-2012 improvements that were tested internally.
Different tasks, thus different optimizations. Physically, larger units mean more area, more power, and less frequency.

FPU is actually a bunch of 32-bit and 64-bit units running a single instruction. So, size isn't really a problem its moving all that data at once that is. Load/store is super energy intensive, so larger loads/stores is about the same as many loads/stores. Also, just because it is called FPU doesn't mean it isn't Integer.

Lots of great info at the top to dig into but I didn't know that before so I will take my time on that.

I know that FPU is just integer math in the end. That's why I say why not have a sufficiently large ALU say an ALU that can handle 2x 1024 bit numbers; if you are doing floating point that requires that much precision it can be done in 2 loads; one for each value instead of the constant load/store plus the associated memory bandwidth issues, power consumption, latency PLUS the additional complexities.

If you are doing maths that do not need such high precision multiple instructions can be encoded and completed with 1024 bits.

Why haven't these ideas been though about?
 

Abwx

Lifer
Apr 2, 2011
11,167
3,862
136
That's why I say why not have a sufficiently large ALU say an ALU that can handle 2x 1024 bit numbers;

ALUs do only the completion of arithmetic ops involving integer numbers, the computation is actually done in the FPU unless the magnitude of the numbers is small enough, in wich cas the ALU has local units at its disposal, the ALU retain the direct execution of everything that is around those maths ops, that is boolean ops, branches and all kind of bits manipulations as well as the completion of all FP ops.
 

UncleCrusty

Junior Member
Jul 25, 2016
22
6
51
512 bit arithmetic would be very expensive to implement. An adder that large would prove very expensive and would have a latency of at least 2 clock cycles, and a fully pipelined multiplier would be both 64 times as large as a similar 64-bit multiplier. With double precision floating point you get 4x the "range" of 512 fixed-point at a high-degree of precision; higher precision is seldom needed.
 

TheGiant

Senior member
Jun 12, 2017
748
353
106
I am really looking forward to getting my Intel 8700 SX!
And the second socket with full activated fpu DX processor!

Moar coolers, bigger boards, bigger cases, higher power draw...

Money for everyone! well except the consumer but that's not so important....
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
I've been thinking about CPU architecture.

Why wouldn't we get rid of the FPU and use larger ALU for example 532 bit ALU that can basically get rid of FPU and do pure integer math.

Do we really need more than 512 bits?

If you use a larger ALU in place of the FPU, does the extra power use go away? The only reason 256/512-bit vector FPU uses so much power is because they are that high performance. If you can make an ALU to replace those wide FPU cores, you'd likely end up using similar amounts of power. It's practically laws of physics at this point.

You would end up with even worse in the power department, because you'd be firing up very large ALUs to do non-FP general purpose calculations. Power Gating is out of the question because the time it takes to fire them back up would incur penalties in performance.

Nevermind issues with compatibility: The job with software and hardware engineers should be to make computing a black box with an interface that reduces as much headaches for the user as possible. Removing a unit that's been in use for decades goes contrary to that idea.

As much as what CPU companies are doing leaves something to be desired, they are still professionals, and thus know much more than those that are not directly involved. If we can't accept that, perhaps its we ourselves that need to curb those expectations.
 

thebcomputerguy

Junior Member
Oct 6, 2017
7
0
1
I never thought that I'd hear people say calm down you're thinking too big but hey.

Power consumption would be high, sure but complexity goes way down. Complexity goes way down because instead of a bunch of load/ store, check overflow, etc for FPU you can do 2 load/store for up to 512 bits in one go or that 1 ALU could work on many 8,16,32,64 even 128 instructions in 1 go.

I still don't understand the issue of compatibility when talking about floating point maths. It all gets converted to 0 or 1 in the end. Dealing with the floating decimal actually increases complexity a lot. Not to mention that we know that increased clock speeds is pretty much a dead end at this point.

About multiplication or division, any number can be multiplied or divided by just a maximum of 2 shifts either to the left or right in the worse case for an add number or one shift for an even number.

We can't change the speed of light sure, but there has to be different ways to tackle the issue of CPU performance, there's only so much faster that a CPU can go before it burns a hole in the earth.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
You didn't read the links I provided. At least look up the accumulation of rounding errors.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,422
1,759
136
Why do you think that a larger integer ALU would be more efficient than an FP ALU? It would strictly do more work. Yes, every operation could be computed using wide enough integer, but FP is basically a performance optimization where you decide that you don't need any more than 24/53 bits of precision and discard the rest. Computing values at 53 bits precision is way cheaper and burns way less power than 512 bits precision.

Yes, FPUs are much bigger and hotter than the integer alus, but that is not because a single FPU operation is more expensive than the corresponding high-bit integer operation, it's because a lot of the typical FP-heavy workloads are data-parallelizable, and therefore to boost speed the FPUs have been made to support reasonably wide SIMD. That is, doing a single 64-bit FP operation wouldn't burn that much power, but the ALUs in the newest Intel CPUs can do 8 in parallel as a single operation, and that does burn a lot of power. A lot less than doing 8 512-bit integer operations in parallel would, though.
 
Reactions: FaaR

Nothingness

Platinum Member
Jul 3, 2013
2,751
1,397
136
About multiplication or division, any number can be multiplied or divided by just a maximum of 2 shifts either to the left or right in the worse case for an add number or one shift for an even number.
That's very wrong. Trivial multiplication is quadratic in the length of the operands. You can get better than quadratic, but certainly not linear as you seem to think.
 
Reactions: Phynaz

whm1974

Diamond Member
Jul 24, 2016
9,460
1,570
96
Nevermind issues with compatibility: The job with software and hardware engineers should be to make computing a black box with an interface that reduces as much headaches for the user as possible. Removing a unit that's been in use for decades goes contrary to that idea.
Not to mention the extra work that would put on developers to enable backwards compatibility in some form, or emulation.
 

thebcomputerguy

Junior Member
Oct 6, 2017
7
0
1
Why do you think that a larger integer ALU would be more efficient than an FP ALU? It would strictly do more work. Yes, every operation could be computed using wide enough integer, but FP is basically a performance optimization where you decide that you don't need any more than 24/53 bits of precision and discard the rest. Computing values at 53 bits precision is way cheaper and burns way less power than 512 bits precision.

Yes, FPUs are much bigger and hotter than the integer alus, but that is not because a single FPU operation is more expensive than the corresponding high-bit integer operation, it's because a lot of the typical FP-heavy workloads are data-parallelizable, and therefore to boost speed the FPUs have been made to support reasonably wide SIMD. That is, doing a single 64-bit FP operation wouldn't burn that much power, but the ALUs in the newest Intel CPUs can do 8 in parallel as a single operation, and that does burn a lot of power. A lot less than doing 8 512-bit integer operations in parallel would, though.

you're missing the point where the numerous loads and stores to get this data into the cpu registers. Not only that but have you actually tried to write a SIMD program? SIMD hasn't taken off because most developers do not know how to reorganize their programs so that they can take advantage of SIMD cores. Most of the SIMD optimizations happen at compile time and unless those developers are writing compilers they don't even know what's happening down there.

Not to mention the extra work that would put on developers to enable backwards compatibility in some form, or emulation.

This mostly is not true, unless you are talking about compiler writers in which case I doubt they'd find it challenging to implement this.
 

whm1974

Diamond Member
Jul 24, 2016
9,460
1,570
96
This mostly is not true, unless you are talking about compiler writers in which case I doubt they'd find it challenging to implement this.
While I'm not an expert on CPU design, I'm sure removing the FPU and substituting a very wide ALU will require a new ISA, which in turn will require emulation for x86 programs which will also reduce performance due to overhead as well.
 

Schmide

Diamond Member
Mar 7, 2002
5,589
724
126
you're missing the point where the numerous loads and stores to get this data into the cpu registers.

Yeah a 64k cache need only hold 1024 operands (512bit) vs 8192 (64bit) vs 16384 (32bit). The bigger the operand the more cache it fills thus the slower it is to load on average.

Edit: did calcs with bit not byte
 
Last edited:
Reactions: Phynaz

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
you're missing the point where the numerous loads and stores to get this data into the cpu registers. Not only that but have you actually tried to write a SIMD program? SIMD hasn't taken off because most developers do not know how to reorganize their programs so that they can take advantage of SIMD cores. Most of the SIMD optimizations happen at compile time and unless those developers are writing compilers they don't even know what's happening down there.

Just freaking stop. 64bit Windows requires SIMD. Period. News flash - SIMD isn't just floating point.

Here's a clue - you haven't had some revelation that every computer scientist over the last 60+ years has missed.

This is nothing but a troll thread.

Attacking other posters is not allowed.
Markfw
Anandtech Moderator
 
Last edited by a moderator:

whm1974

Diamond Member
Jul 24, 2016
9,460
1,570
96
While I'm not an expert on CPU design, I'm sure removing the FPU and substituting a very wide ALU will require a new ISA, which in turn will require emulation for x86 programs which will also reduce performance due to overhead as well.
I'm now wondering if I'm talking out my rear here, but am I correct on this? Wouldn't a CPU designed with a very wide ALU and no FPU even if backwards compatible with x86 will still need new code to written to use any advantages this would bring?
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |